CS194-26 Project 4: Image Warping and Mosaicing

In this project, I used projective transformations to warp images and stitch mosaics of images together.

Shooting the Pictures

To achieve proper blending, I needed to obtain sets of images that were different enough to represent different perspectives but similar enough to have correspondences I could use to map pixels between them. I ended up using the following three sets:

The first set consists of two images of the Campanile Esplanade at the UC Berkeley campus, photographed by me during the evening. The two images are taken from the same center of projection, but capture the Campanile at different heights.

campanile

The second set consists of two images of a poster of Japan, photographed by me. These are also taken from the same center of projection.

poster

The third set consists of two images of Moffitt Library at the UC Berkeley campus, photographed by Isaac Gonzalez at night. The two images are taken from the same center of projection, but their fields of view are rotated from each other horizontally.

moffitt

In addition to the three sets above, I took two more images to test that my images were being "rectified" properly.

The first image is of a kitchen floor, photographed by me. Knowing that the tiles on the kitchen floor are square, I defined a "rectification" accordingly.

kitchen-floor

The second image is of my laptop keyboard, photographed by me while coding this project. Knowing that the keys on my keyboard are approximately square, I defined a "rectification" accordingly.

keyboard

Recovering Homographies

To recover homographies, I needed a general function that could identify the projective transformation needed to go between sets of correspondence points. To do so, I used an approach similar to that described in the following article:

https://towardsdatascience.com/estimating-a-homography-matrix-522c70ec4b2c

As described in the article, I represented projective transformations as 3x3 matrices H that transformed pixels in homogenous coordinates using the relation p' = Hp. To find the appropriate entries of H, I flattened the matrix into an 8x1 vector h (since there are 8 degrees of freedom in a projective transformation in 2D) and hard-coded the ninth entry to be 1. I then used a least-squares approach that used a 2nx8 matrix A and a 2nx1 vector b that depended only on the coordinates of the correspondence points of the two images. The setup of the A matrix and b vector are the same as the setup described in the article above, except that rows were added to both to account for a greater number of correspondence points for robustness. To ensure that the system was not underdetermined, I selected a minimum of 4 correspondence points in each image.

Warping the Images and Image Rectification

Using prior knowledge that the tiles on the kitchen floor are square, I obtained the following warp:

kitchen-floor

To obtain this warp, I used nine correspondence points aligned as a 3x3 square grid in the output image.

The resulting image looks like it was taken from a bird's-eye view.

Using prior knowledge that the keys on my keyboard are square, I obtained the following warp:

keyboard

To obtain this warp, I used twelve correspondences. Each correspondence point marked a corner on either the A key, the G key, or the L key.

Consequently, the resulting image looks like a bird's-eye view of my laptop keyboard taken from directly above the G key.

Blending the images into a mosaic

To blend the images into a mosaic, I similarly had to annotate correspondences in each image in each set.

For the images of the Campanile, I annotated correspondence points along the tower itself, using its windows and bricks as a guide. Doing this, I found 12 correspondence points in the region of overlap between the images, and the result of the blending is as follows:

campanile

For the image of the poster of Japan, I annotated 12 correspondence points along the bridge railings, 1 correspondence point on the tip of one of the hats, and 2 correspondence points above the text on the bottom right with the red background, for a total of 15. I annotated these points because they were relatively easily identifiable and were in the region of overlap. The result is as follows:

poster

For the image of Moffitt Library, I annotated 8 correspondence points along the grills at the top, which have a rectangular structure. The points I annotated were in the region of overlap between the two images, and the result is as follows:

moffitt

To blend my images, I used an interpolation scheme that involved flooring the x and y coordinates of the pixel to be interpolated. There was little if any aliasing that resulted because of this, so I decided to move forward with the scheme.

Recalling the examples of blending from lecture, I decided not to use alpha-blending or multiresolution blending. The photographs taken were relatively similar, and I feared it would add artificial smoothness the blended region in the image if either scheme was used.

What I've learned

I learned that having rectangular correspondence points is important when trying to recover homographies. The correspondence points I used for the image of the Campanile and Moffitt library were rectangular, whereas the correspondence points I used for the image of the poster of Japan were not. I suspect this was the main factor in the difference in the quality of the image mosaics.

Detecting features

To automatically detect features, the Harris corner strength function was used to detect corners in the images. Corners were defined as pixel coordinates that had a Harris corner strength above a certain threshold. This threshold was experimentally determined for each image.

campanile

Threshold: 1.0

campanile

Threshold: 1.0

poster

Threshold: 0.2

poster

Threshold: 0.2

moffitt

Threshold: 0.04

moffitt

Threshold: 0.04

To detect features more uniformly from across the images, I implemented Adaptive Non-Maximal Suppression. This entailed finding the approximately 500 points with the graetest suppression radius, where the suppression radius is the radius over which a given point is a local maximum in Harris corner strength, with some robustness factor c_robust. For the following images, I used a value of c_robust = 1.0, since the features it generated were adequate. The suppression radius was found experimentally such approximately 500 points were generated for each image. The features extracted via Adaptive Non-Maximal Suppression are shown below:

campanile

poster

moffitt

Extracting features

For each corner, I constructed a feature detector by looking at the 40x40 neighborhood of pixels surrounding the sub-pixel coordinate of the corner of interest. This 40x40 neighborhood was then sampled by looking at the 8x8 pixel neighborhood surrounding the corner in a 5x downscaled Gaussian pyramid level. These were then normalized for bias (by subtracting the mean) and gain (by dividing by the standard deviation).

Matching features

I used the sum of squared differences and Lowe's method to find robust feature matches. This involved computing the sum of squared differences between each possible pair of features between two images, recording their two nearest neighbors, and finding the ratio between these neighbors. The feature pairs with ratios below 0.4 were then selected for the RANSAC algorithm. The features that were paired for each image are shown below.

campanile

poster

moffitt

Robust homographies

The RANSAC algorithm was used to find homographies that are more robust to incorrect feature matches. The RANSAC algorithm works as follows:

First, four points are randomly selected from the matches identified by Lowe's method above.
Next, a homography is calculated using the four randomly selected points. This homography is then scored based on the number of "inlier" points, where "inliesr" are defined by an allowed margin of error epsilon.
The process above is then iterated. The set of "inliers" for the highest-scoring 4-point homography is then used to compute the final homography.

I used 1000 iterations for each image.

Mosaics

The results of hand-selected features vs. automatically-selected features are below:

campanile

Hand-selected

campanile

Automatically selected

poster

Hand-selected

poster

Automatically selected

moffitt

Hand-selected

moffitt

Automatically selected

What I've learned

I learned that error in the selection of feature points has a very big impact in homography calculations. An error of as little as 5 pixels can ruin a homography. In the mosaics above, homographies created from hand-selected features worked better for pictures with visible corners, whereas homographies created from automatically selected features worked better for pictures where this was not the case.