CS 194 - Final Projects

Haoyan Huo

Table of Contents

Project 1 - Poor Man's Augmented Reality Go to TOC

In this project, I demonstrate inserting an artificial object into a captured video: Poor Man's AR!

Note: click the videos to view it in YouTube

1. Creating keypoints and capturing video Go to TOC

The first step is to create an object with predefined keypoints. For this project, I used a small soapbox warped in an A4 paper. I drew a mesh grid on the surface, each cell in the mesh has a size of 1x1 inch. The 3D world coordinates of keypoints are defined as shown in the left panel of the following figure. The screen coordinates of the keypoints are manually selected using matplotlib.pyplot.ginput, as shown in the right panel.

Then, I captured a video using my iPhone XR rear camera:

2. Track keypoints in a video Go to TOC

Obviously, we couldn't annotate all keypoints in all frames. So I used the median flow tracker implemented in OpenCV to automatically propagate keypoints from one frame to the next frame. To do this, I defined a 16x16 patch for each keypoint and initialized 24 median flow trackers using the initial positions. Note I didn't use an 8x8 patch because my video has higher resolution so a larger patch works better.

Median flow tracker is not perfect: sometimes when the image is changing too fast, they will lose track of the keypoints. To solve this problem, I used the RANSAC algorithm to create image homographies between two frames, and identify keypoints that have large deviations, then use the homography to correct such keypoints and re-initialize trackers.

The following left video shows keypoints tracker without the RANSAC algorithm. The right video shows keypoints tracker with the RANSAC algorithm. Points that are lost by the tracker are quickly corrected by the RANSAC algorithm.

3. Calibrate the camera and cube-augmented reality Go to TOC

The camera projection matrix projects 3D world points into 2D image coordinates: $P_{2d} = M P_{3d}$. The camera projection matrix has 11 degrees of freedom: $$M = \begin{bmatrix} M_{11} & M_{12} & M_{13} & M_{14} \\ M_{21} & M_{22} & M_{23} & M_{24} \\ M_{31} & M_{32} & M_{33} & M_{34} \\ M_{41} & M_{42} & M_{43} & 1 \end{bmatrix}$$ Like in the image mosaicing project, I set up linear equations and used least squares method $\hat{M} = \arg\min_{M'} \sum_{i=1}^n ||MP_{3d}^i - P_{2d}^i||^2$ to solve the matrix with the 11 unknowns.

Once the camera projection matrix is known for each frame, we can then create a cube in the 3D world coordinate system and use the matrices to project then into 2D image coordinates. Since lines in 3D are still lines in 2D, we can use draw functions in OpenCV such as polylines/fillPoly/line to render the cube directly on images.

In the following video, we put a cube that is centered at $(1, 1, 0.5)$.

Bells and whistles: Rendering AR video with Sather tower! Go to TOC

1) Theory of decomposing camera projection matrix into a pose matrix

A camera projection matrix can be decomposed into a camera intrinsics matrix and a 3D rotation/translation matrix $M=C[R|t]$. The camera intrinsics matrix describes how the camera projects 3D coordinates onto its sensors in the pinhole camera model: $$C = \begin{bmatrix} f_x & 0 & p_x\\ 0 & f_y & p_y\\ 0 & 0 & 1 \end{bmatrix} $$ where $f_x, f_y$ are focal length in the unit of pixels and $p_x, p_y$ are the image coordinates of the optical axis. The 3D rotation/translation matrix $[R|t]$ translates the 3D world coordinate into camera centered coordinate system: $$[R|t] = \begin{bmatrix} R_{11} & R_{12} & R_{13} & t_x\\ R_{21} & R_{22} & R_{23} & t_y\\ R_{31} & R_{32} & R_{33} & t_z\\ 0 & 0 & 0 & 1 \end{bmatrix} $$

2) Using OpenCV to guess $[R|t]$ and get object/camera pose for rendering

OpenCV provides us with two functions that are able to decompose a camera projection matrix into $[R|t]$ matrix. The first one is cv2.calibrateCamera. This function can optimize both the $C$ matrix and the $[R|t]$ matrix, as well as learning a distortion coefficient that fixes camera distortions. However, I finally decided this is not the best for my project because the additional degree of freedom of distortions really makes the $C$ matrix and the $[R|t]$ matrix unstable.

The other function is cv2.solvePnP which requires a fixed $C$ matrix and distortion coefficients. Based on my intuition I think my iPhone XR rear matrix has little distortion so I used all zeros for distortion coefficients. I also used a fixed $C$ matrix as described below.

3) Creating camera intrinsics matrix for iPhone XR rear camera

For the iPhone XR I'm using, I can get the focal length $f=4mm$ from the EXIF information of the video. The pixel size is believed to be $1.4\mu m$. The video size is 960x544 in my experiments, which is scaled down from the max resolution 4032x3024. So, the focal length in unit of pixels should be $f_x=f_y=4/(1.4\times 10^{-3}\times \frac{4032}{960}) = 680$. I used the dimensions of width axis because I believe the camera sensor is a square, and width axis fully uses all sensor pixels.

For $p_x, p_y$, I assume iPhone optimized their camera hardware/codes very much, so I used the image center as $p_x, p_y$.

4) Render Sather tower using pyrender

The Sather tower model is downloaded from here. The model is scaled down by factor of 1000 to fit into my 3D world coordinate system. To set up the scene, I applied the $[R|t]$ matrix to the Sather tower model to move it into pyrender's camera's coordinate system. Note that I also need to apply a $180^\circ$ rotation around X axis for pyrender's camera. This is because the pinhole camera model assumes camera faces $+z$ direction but pyrender assumes camera faces $-z$ direction.

The final rendered AR scene is shown below.

Conclusion Go to TOC

This is a super interesting project! The camera projection matrix method is similar to the image mosaicing project, but here we can put a 3D object into our video, which adds lots of fun. Although I still need to manually identify keypoints and assign 3D coordinates for them, I imagine there is some learning algorithms that automatically extracts 3D coordinates, i.e. machine-learning camera projection matrix.

The Bell & Whistles part took me lots of time to figure out how to extract $[R|t]$ matrix from a camera projection matrix, partly due to the lacking of OpenCV documentation and examples. But I learned a lot in the process, especially how the difference conventions (camera heading for instance) could require special handling. Also, putting arbitrary 3D objects into a video adds another level of fun here.

Project 2 - Neural Style Transfer Go to TOC

1. Style transfer neural network Go to TOC

Following the original by Gatys et al., the style transfer neural network used in this project is the convolution blocks of VGG-19 network, it has the following layers:

  1. layer 1-1: Conv 3x3x64, layer 1-2: Conv 3x3x64, layer 1-3: AvgPooling 2x2
  2. layer 2-1: Conv 3x3x128, layer 2-2: Conv 3x3x128, layer 2-3: AvgPooling 2x2
  3. layer 3-1: Conv 3x3x256, layer 3-2: Conv 3x3x256, layer 3-3: Conv 3x3x256, layer 3-4: Conv 3x3x256, layer 3-5: AvgPooling 2x2
  4. layer 4-1: Conv 3x3x512, layer 4-2: Conv 3x3x512, layer 4-3: Conv 3x3x512, layer 4-4: Conv 3x3x512, layer 4-5: AvgPooling 2x2
  5. layer 5-1: Conv 3x3x512, layer 5-2: Conv 3x3x512, layer 5-3: Conv 3x3x512, layer 5-4: Conv 3x3x512, layer 5-5: AvgPooling 2x2

All convolutional layers have padding size 1 and are followed by ReLU layer. All AvgPooling layers have stride 2.

The loss function has three parts, whose weights are decided by inspecting the constructed image quality:

  1. Content loss: MSE loss of the outputs of layer 5-2 for both the original image and the constructed image.
  2. Style loss: MSE loss of the Gram matrix of the outputs of layer 1-2, 2-2, 3-2, 4-2, and 5-2 for both images, each weighted by 0.2.
  3. Total variation (TV) loss: Total variation loss of the constructed image

I used Adam optimizer with learning rate 10.

The differences to the original paper are 1) introduction of TV loss to denoise images 2) usage of the second layer output of each block for style loss 3) a much larger $\alpha/\beta$ (ratio of content/style loss weight), typically 0.1 - 1.

One caveat here is that I did not use the VGG-19 pretrained model provided by torchvision: I believe some training variations has caused the torchvision weights to perform poorly in the style transfer task. I ended up downloading the caffe model (as stated in the paper) and converting all weights to pytorch format. (typical bad deep learning reproducibility!)

2. Reproducing Neckarfront results in the paper Go to TOC

I then reproduced several style transfers for the photography Neckarfront:

  1. The Starry Night by Vincent van Gogh
  2. Der Schrei der Natur by Edvard Munch
  3. The Shipwreck of the Minotaur by J.M.W. Turner
  4. Composition VII by Wassily Kandinsky
  5. Femme nue assise by Pablo Picasso

3. Tranfer other styles and other images Go to TOC

I also collected a few images and found these results interesting.

  1. Neckarfront to the style of One Hundred Horses by Giuseppe Castiglione
  2. My photo of Lake Tahoe to the style of The Great Wave off Kanagawa by Hokusai. The clouds become huge tides in the sky, and sun light penetrates tides through the broken big hole.
  3. A photo of pears to the style of still painting Teller mit Pfirsichen by Giovanni Ambrogio Figino. The pear skins have some funny strokes.

Those photo style transfers aren't so perfect, as there are still some irrelevant brushes, for example the unrealistic colors on the pear skins, the sky in the Neckarfront and the Tahoe sky. I think by training more they will disappear.

Finally, I show one failed style transfer. I tried to transfer the style of a photo of the streets to the style of one of my favourite Anime movies. There are a few problems. First, the skies are mixed with unfit real/anime clouds. Second, it seems a few patches are completely copied from the anime picture to the constructed image, which do not blend into the background. Third, it seems the algorithm failed to find a matching style for the railway and the road in the bottom of the picture.

However, the algorithm did not fail completely. For example, it correctly repaints the traffic lights. The buildings are transferred into anime painting style. The style transfer seems to work well for the greens too. I think the main problem is that I should find a street that is more similar to the anime image.

Conclusion Go to TOC

This project is also super fun. I had lots of difficulty figuring out why the VGG-19 network by torchvision did not produce a good result (which still puzzles me). I also looked for other tutorials and resources to understand how different slight hyperparameters (i.e. padding, kernel size, learning rate) could affect the produced result. Unfortunately, none of them really worked for my network, I guess the field of deep learning art is similar to the real fine art: no one really knows how it works even though it produces beautiful visual effects.