Here's the result of training the network on the provided demo image.
By the time it reached 500 iterations, the model already learns the input image well.Here we demonstrate the result on another image of the Painted Ladies in SF.
For this picture, the set of hyperparameters chosen werenum_iterations=3000, batch_size=10000, max_freq_L=10
.
x_w = transform(c2w, x_c)
This function helps convert the camera space coordinates to the world space, using the camera-to-world (c2w) transformation matrix. c2w is the inverse of world-to-camera (w2c), represented as
\[ w2c = \begin{bmatrix} \mathbf{R}_{3\times3} & \mathbf{t} \\ \mathbf{0}_{1\times3} & 1 \end{bmatrix} \]
2.pixel_to_camera(K, uv, s)
This function transforms a point from the pixel coordinate system back to the camera coordinate system.
3. pixel_to_ray(K, uv, s)
This function converts a pixel coordinate to a ray with origin and normalized direction.
In addition, I implemented a DataLoader class that supports sampling rays from images, as well as sampling points along rays.
We can visualize the samples:
Next, I implemented the neural architecture as described in the following graph.
I also coded the discrete approximation of the volumn rendering equation:
\[
\begin{align} C(\mathbf{r})=\int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t))
\mathbf{c}(\mathbf{r}(t), \mathbf{d}) d t, \text { where } T(t)=\exp \left(-\int_{t_n}^t \sigma(\mathbf{r}(s)) d s\right)
\end{align}
\]
\[
\begin{align}
\hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{c}_i, \text { where } T_i=\exp
\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \end{align}
\]
Here are the PSNR results from the trained model:
Because it has been trained on a relatively small amount of time (the longer the training time, the better the PSNR), the result is somewhat blurry, but we are still able to see the lego technic tractor.