Part 1: Fit a Neural Field to a 2D Image

The architecture of the neural network from the provided image consists of a sequential model that begins with an input of a 2D tensor, which is processed with positional encoding (PE). The model comprises a series of linear layers, interspersed with ReLU activation functions. The sequence concludes with a linear layer outputting a 3D tensor, followed by a Sigmoid activation function, to yield RGB color values. The training of this model involved substantial hyperparameters: a batch size of 100,000, suggesting a significant amount of data processed per iteration; 1,000 epochs indicating an extensive training duration; and 10,000 samples representing the dataset size. The learning rate was set to a moderate 1e-4, and a hidden dimension of 512 units was specified. The training process likely aimed to optimize the Peak Signal-to-Noise Ratio (PSNR), a standard measure for assessing image reconstruction quality.

Fox PSNR Curve for every 50 epochs
Fox Training Seq
OG Fox
BWW PSNR Curve for every 50 epochs
BWW Training Seq
OG BWW

Part 2: Fit a Neural Radiance Field from Multi-view Images

Part 2.1: Create Rays from Cameras

Camera to World Coordinate Conversion

It would take a 4x4 c2w transformation matrix and a batch of points in camera coordinates. The points would be extended to homogeneous coordinates (adding a 1 as the fourth component), multiplied by the c2w matrix, and then brought back to regular 3D coordinates.

Camera to World Coordinate Conversion

Using the intrinsic matrix K, and given pixel coordinates and depth s, this function will convert these 2D points into 3D points in the camera coordinate system.

Pixel to Ray

The ray origin is the camera position in world space, and the direction is calculated by transforming a point at depth 1 from camera to world space and normalizing it.

Part 2.2: Part 2.2: Sampling

Sampling Rays from Images

This involves random sampling on images to obtain pixel coordinates and colors. These are then converted into ray origins and directions using the camera intrinsics & extrinsics, accounting for the offset to the pixel center.

Sampling Points along Rays.

Discretize each ray into samples in 3D space. This is done by uniformly creating samples along the ray and introducing small perturbations during training to cover every location along the ray.

Part 2.3: Data Loading

Sampling Rays from Images

This involves random sampling on images to obtain pixel coordinates and colors. These are then converted into ray origins and directions using the camera intrinsics & extrinsics, accounting for the offset to the pixel center.

Visner

Part 2.4: Neural Radiance Field

The network will be an MLP that takes 3D world coordinates and a 3D ray direction vector as input and outputs color and density. The ray direction is encoded using positional encoding. The network is made deeper, and inputs (after positional encoding) are injected in the middle of the MLP.

Figure from CS180 Proj5 Spec

Part 2.5: Volume Rendering

This involves computing the discrete approximation of the volume rendering equation using the color and density obtained from the NeRF network. This is implemented in PyTorch to enable backpropagation.

Lego PSNR Curve
Lego Training Seq
Full Circle Lego