berkeley logo

Programming Project
CS194-26: Image Manipulation, Computer Vision and Computational Photography

Poor Man's Augmented Reality

 


setup   setup
In this augmented reality project, you will capture a video and insert a synthetic object into the scene! The basic idea is to use 2D points in the image whose 3D coordinates are known to calibrate the camera for every video frame and then use the camera projection matrix to project the 3D coordinates of a cube onto the image. If the camera calibration is correct, the cube should appear to be consistently added to each frame of the video.

Setup

Find a flat surface and place a box (e.g. shoebox) on the surface. Draw a regular pattern on the box. Decide on atleast 20 points from the pattern which you will mark in the image, and label their corresponding 3D points. Make sure that they are not all planar. Capture a video with the box at the center. The input video can be something as show above on the left.

Keypoints with known 3D world coordinates

We will start by marking the points in the first image (using plt.ginput) of the video and getting their 3D world points (you can use skvideo.io.vread package to read and skvideo.io.vwrite to write videos). You need to measure the length of the sides of the box and the distance between the consecutive points in the pattern. Having a regular pattern helps you automate the labeling of the 3D points. Once you get all the measurements, you can get the 3D coordinates of each point by fixing the 3D world coordinate axes to be centered at one of the corners of the box as shown. Note that regardless of which frame you are looking at, the world coordinates of these points will remain the same.
Example setup

Propogating Keypoints to other Images in the Video

There are several ways of propogating the points from the first image to the subsequent images. The end result of this procedure should be a paired set of 2D and 3D points for every frame in the video:

1. A Hacky Corner Detector: One way to propogate the points from the one image to the next image is by exploiting the temporal signal in the video. First, we will detect the corners in img[i] and img[i+1] using a harris corner detector: harris.py, harris.m. Lets call them ci and cnext respectively. Since the points will not move by a large amount in consecutive frames, we can find the closest point (in pixel space) from the set cnext for every point in ci. We will further accept only those points for which the pixel space distance is below some threshold. This detector critically depends on small motions in the video, and no spurious corners within the threshold radius. We can start with the marked points from the first image, and then compute the tracked points in the next image. We will continue all the way to the last image while keeping track of the successfully tracked 2D points and their corresponding 3D coordinates (which we know from the first image).

2. Off the Shelf Tracker: You can also use an off the shelf tracker. This tutorial explain the usage of various trackers available in cv2. The one I was able to successfully use was the MedianFlow tracker: cv2.TrackerMedianFlow_create(). You need to initialize a separate tracker for each point. I used an 8x8 patch centered around the marked point to initialize the tracker. Note that the bbox describes the bounding box using 4 values, where the first two coordinates are the start (top left) coordinate of the box, followed by width and height of the bounding box. Update the trackers for each new frame to get the points on the next frame. Keep a track of the points and their corresponding 3D points.

The result of the tracked points should look something like this:

setup

Calibrating the Camera

Once you have the 2D image coordinates of marked points and their corresponding 3D coordinates, use least squares to fit the camera projection matrix to project the 4 dimensional real world coordinates (homogenous coordinates) to 3 dimensional image coordinates (again, homogenous coordinates). Perform this step separately for each frame in the video.

Projecting a cube in the Scene

Once you have the camera projection matrix, project the axes points defined at the end of [2] in Resources, and use the draw function to draw the cube on the image (defined just above the axes points in [2]). Note that the draw function takes an unnecessary parameter corners which it doesn't use. Just inputting the projected points for imgpts should suffice. This will place the cube of size 1 unit at (0,0,0). You should translate and scale the coordinates in the axes points to place the cube at a suitable location. Once you render the cube independently for each image, you can combine the images into a video and view the output result.

Deliverable

You need to show the input and the output videos with the cube. Note that one way to improve the result, is to have more keypoints and try to make them as accurate as possible.

Bells & Whistles

You can also try placing an arbitrary mesh of your choice onto the scene using an off the shelf python mesh renderer (pyrender [5]). However, note that you will need to further decompose the camera projection matrix into camera intrinsics, rotation and translation. cv2.calibrateCamera implements this functionality. The decomposition of the camera projection matrix is used by the renderer to figure out self occlusion.

Resources

[1] Python cv2 feature matching tutorial
[2] Camera Calibration
[3] Blender
[4] Tracking Tutorial
[5] Python Rendering


This assignment was designed by Ashish Kumar and Alexei Efros.