Project 6: Allan Zhou

Part A

My Photos

I used my phone to take photos of scenes from the same position (approximately), making sure to overlap.

The engineering library:

library

The view outside my apartment complex:

library

My room:

room

Recovering Homographies

For each set of images, I marked 8 corresponding points using Matplotlib's ginput, trying to click corners in the overlapping regions.

When warping a set of images, I artificially defined the destination plane's points using the average of the points from each image. Then we can warp each image (or source) to the destination, using that image's marked points. For notation, say a source point is $(x,y)$ , and the destination point is $(x', y')$ . We want to find a homography from the source plane to the destination plane:

$\begin{bmatrix}wx' \\\ wy' \\\ w\end{bmatrix} = H\begin{bmatrix}x \\\ y \\\ 1\end{bmatrix}, \text{where } H=\begin{bmatrix}a & b & c\\ d & g & f\\ h & i & 1\end{bmatrix}$

H is a $3\times 3$ matrix, where the bottom right element can be assumed to be $1$ (disregarding some scale factor), so there are only $8$ parameters to recover. By writing out the equations for each row-vector multiplication and doing some algebra, we can rewrite this in a form:

$\begin{bmatrix} x'\\\ y' \end{bmatrix} = \begin{bmatrix}x & y & 1 & 0 & 0 & 0 & -xx' & -yx'\\ 0 & 0 & 0 & x & y & 1 & -xy' & -yy'\end{bmatrix}\begin{bmatrix}a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \\ i\end{bmatrix}$

This is convenient because we can "stack" these equations for each pair of points to form a linear system (over-constrained if you have more than 4 pairs of points) and then solve for the $8$ parameters of $H$ using least squares.

Using this approach, I recovered homographies for mapping each image to my defined destination plane.

Warping the images

Using the recovered $H$ 's, I warped each image to the destination plane. To actually do the warp I use inverse warping. I use $H$ to find the corners of the image in the destination plane, and then find all the pixels inside the polygon defined by those corners. Since these are the destination pixels, these are the $(x', y')$ from the formula above. I multiply by $H^{-1}$ to find $(x,y)$ and then use bilinear interpolation to sample the value from the source.

Image rectification

To rectify an image, I take only a single image of a planar surface and then apply the warping from before to warp it to a frontal-parallel plane.

As a test, I use this picture of an intersection from a traffic camera (this is actually the background extracted from traffic cam video):

traffic-cam

I want to create a rectified (top-down) view of the center of the intersection, which is a planar surface. I selected four points at the corners of the intersection using ginput. Since I know the intersection is approximately square, I defined the corresponding destination points to be the corners of the square output image. Applying the warp function to this image gives the top-down view:

intersect-rectified

Another rectification example, an advertisement for a nice rug. The ad helpfully defines the true dimensions of the rug itself, making it easier to choose the destination points.

rug

The rectified top-down view after warping (I rotated the image to conserve space on the page):

rug-rectified

Making mosaics

To make a mosaic, I first define an output canvas, creating an empty canvas large enough to contain all the corners of the warped images. Then, I apply the warp to each image (I also compute the mask here) and then put all the warped images into the mosaic.

The mosaic will have overlapping regions from different images. One way to resolve this issue is to just choose one image's pixels. For example, here is what happens if you use the naive method of choose the right image's pixels for any overlap:

library-naive

If you look closely, there is a seam around 1/3d of the way through the image left-to-right.

An improved method is to linearly blend the images together by linearly interpolating the overlapping region from left to right.

Linear interpolation results

Here are the completed mosaics after linearly interpolating overlapping regions:

library-mosaic

street-mosaic

Linear blending works pretty well compared to the naive method, but you can still notice a little bit of the boundary. Additionally, in either method the images do not line up exactly--they are a couple of pixels off. This probably because the method of selecting corresponding points (me clicking on things with ginput) is not very accurate.

Reflection

Getting the linear blending to work properly without making wedge artifacts above and below the overlap is very tricky.
Creating rectified view of single images is surprisingly simple with homographies and pretty cool.

Part B

Harris Corner Detection

Using the provided corner detector code on the grayscale version of each image returns hundreds of thousands of potential corners. Plotting all the corners on an image would completely cover the image, so instead I only show the top 5% of corners by response (or h) value:

library-mosaic

Adaptive Non-maximal Suppression

I implemented ANMS to choose corners that are spread relatively evenly over the image, while also bringing down the total number of corners in the image (to speed up later processing). To implement ANMS, we want to calculate a score $r_i$ for each corner $x_i$ :

$r_i = \min_{j:0.9h_j > h_i} ||x_i - x_j||_2$

Here $h_i$ is the response of corner $i$ , so this optimization finds for each corner $i$ the distance to the nearest significantly stronger neighboring corner. A low value of $r_i$ means the corner is close to a lot of much stronger corners, so we can discard those corners. I sort the corners by $r_i$ from high to low and then choose the top 500 corners.

After suppression, we can see that there are now only 500 corners spread fairly evenly throughout the image:

library-mosaic

Feature Descriptor Extraction

I extract the $40\times 40$ patch surrounding each corner, then apply Gaussian smoothing and downsample to $8 \times 8$ . I also normalize each patch by subtracting its mean and dividing by the standard deviation. Here is what one of the descriptors looks like (zoomed in):

library-mosaic

Feature matching

I use dist2 to calculate the distance between each patch from the first image and each patch from the second image. For each patch from the first image, the 1-NN is the distance to the closest patch from the second image, and the 2-NN is the distance to the second closest patch from the second image. For each patch from the first image, I look at the ratio (1-NN / 2-NN) as per Lowe, and define a match cutoff using this ratio set at 0.4. So for any patch in the first image if the ratio is smaller than 0.4, I choose the closest patch from the second image as a match. Using this method, I define matching pairs of corners between the two methods.

RANSAC

I implement 4-point RANSAC as described in class. I repeatedly sample 4 matching pairs of corners, and compute the homography $H$ using my part A code. Then I count inliers out of all the corners, by checking the condition:

$||Hx_i^{(1)} - x_i^{(2)}||_2 < \epsilon$

Here $x_i^{(1)}$ is the $i$ th corner in the first image, $x_i^{(2)}$ is the matching corner in the second image, and $\epsilon=10$ .

I repeat this process 10000 times, keeping track of the biggest set of inliers I've found so far. After 10000 iterations, I take my biggest set of inliers and compute a final homography to do the mosaic creation (using code from part A).

RANSAC vs Hand-labeled points

The RANSAC mosaics are significantly better aligned than the ones done by hand. For example, here is a close-up of the street mosaic with the RANSAC (left) and the hand-labels (right):

ransac-vs-hand

RANSAC Results

Below are the complete RANSAC-generated mosaics:

library-mosaic-ransac

street-mosaic-ransac

room-mosaic-ransac

Reflection

The 1-NN / 2-NN trick is very simple but very effective at eliminating "fake" pairings. RANSAC is very fast and effective, even when all the previous steps still leave "fake" pairings in the data.

Bells and Whistles

Cylindrical Projection

I implemented the inverse cylindrical projection described in class. In particular:

$x = f \tan(\frac{x'}{s})\\ y = f\frac{y'}{s}\sec(\frac{x'}{s})$

Although the true focal length $f$ could be recovered from the EXIF data and some calculations, tuning $f$ by hand produced decent results. I set $f=3000$ for my images, which are of resolution $4032\times 3024$ . I set the radius $s$ to the focal length.

library-mosaic-ransac-cylindrical

street-mosaic-ransac-cylindrical

room-mosaic-ransac-cylindrical