Project 6 Part I: Stitching photo mosaic

By Alex Jing (acw)

I. Introduction

Projective transformation models the image capturing system and lights as light rays converging to a centero of projection, i.e. our camera. The image is obtained by placing a projection plane in front of our camera, i.e the focal plane of our camera. However, there is nothing stopping us from changing the focal plane and re-project the light rays on a different plane, thereby achieving the effect of image rectification and photo mosaicing.

II. Obtain the homography

For the same lighting condition, projecting the light rays on different projection planes produces images from different perspective. These images are related through the relationship of projective transformation. The projective transformation is captured by a 3 by 3 isomorhpic matrix called homography. Hence, to re-project an image, the first task is to recover the homography. To recover the homography, we have to find at least 4 corresponding points in the source image (the image to re-project) and target image (the image whose projection plane we will project on). Four pairs of corresponding points give us 8 sample points, which allows us to recover the homography which has a degree-of-freedom of 8. Once we obtain the corresponding points, we use the least square method described here to recover the homography

.

III. Image rectification.

One application of homography is image rectification. The idea is that by the semantic content of the image, we know that something should have right angle, e.g tiles, frames, bricks, walls, which are projected into different angles due to projection. To rectify such content, we first obtain the homography. The source sample points are the corners of the right-angled content, the target sample points can be just the four corners of the result image (as such, the right angles are recovered in re-projection). Here is one example.


San Francisco Morning
Green House
Image of rectified window.
Image of rectified window.

IV. Image mosaic

We can also apply the concept of homography to obtain panoramic image. If we take a few images that have the same center of projection, e.g. images taken from a tripod and only rotating the camera, we can find the homography between each pair of images. Then, we can use one image as the target plane of project and re-project the other image onto this reference image. Once re-projected, we can use simple blending (e.g. alpha blending) methods to put together these images.

First, we calculated the result image dimension and warp the two images onto the result image correspondingly

Left image warped onto the result image.
Center image warped onto the result image.

Notice that we only translate the center image to the final position of itself in the result image since we are re-projecting all other images on the the projection plane of the center image.

Once we have the warped images, we just blend them together. A little more on the details of blending: to avoid obvious ghosting, I chose to blending only the portions in either images that containt the feature points. I picked the leftmost feature points and rightmost feature points to form central band of blending areas. Then I used alpha blending in this band and use the original pixel values from each image outside of this band.

Linear mask for the left warped image.
Two images blended together.

Finally, the stitched image is clipped to remove all the blank areas leftover by projection. And we repeat the above process with the stitched two images and one new image on the right, we can obtain a mosaic.

Monterey, CA, 2017 (The black dot is a dirt on my lens)

source:

left
middle
right

A few more examples:

Light post (3 images mosaic, slight ghosting on the right floor)

source:

left
middle
right
Light post (taken with camera's panorama function for reference)
Beach (4 image mosaic)

source:

one
two
three
four
Beach (taken with camera's panorama function for reference)

V. Bells and whitles: Cylindrical projection

The previous application assumes a planar projection plane. However, if we assumes a cylindrical projection plane, our transformation matrix is reduced to mere translation from projection. This is because when we rotate the camera, we are essentially projecting onto a rotated projection plane. By putting these projection planes together, we can approximate a cylinder. By translating based on feature points, we can produce rather amazing results. However, due the the lack of consideration for projection, we can observe that straight lines in this image is no longer straight.

Cylindrical projection works really well if the object is really far away, since if the object is far, there is less arc in its light right projection plane. Hence a mere translation of images could give really good results. It's even easier than homography since less feature points are required.

Beach on cylindrical projection, more aesthetically pleasing than homography result

On the other hand, when there are many lines in the scene and the objects are closer, the lack of consideration for projection can have more obvious artifacts.

Light post on cylindrical projection, light post cut in half; weird lines on the floor

VI. Reflection

I really enjoyed combining artistic creation with the computer science knowledge we learned in class. I also found applying linear algebra in homography very powerful and amazing.

Project 6 Part II: Automatic Feature Detection

In the previous part, the correspondence points are manually defined, which greatly restricts the extensibility of the algorithm. In this part, we will implement the algorithm described in this paper to find the correspondence points automatically.

I. Feature detection

As described in the paper, we will use Harris Corner detector as our feature detector. The advantage of Harris corner detector is that it is rotational invariant, which is very important in our application since we are dealing with projected images.

To find Harris Corners, we computed the image response R and take the points with the highest R values in the image as our primary interesting points. Note that when selecting the points with the highest R, we need to threshold the minimum value of R as a ratio of the maximum response value in the image to avoiding selecting too many interesting points.

Original image and the Harris Corner Response

Even with the thresholding, there are still too many interesting points to process. To further subsample from the primary intersting points, we can apply the method described in the paper: Adaptive Non-Maximal Suppression. The idea is to have the subsampled interesting points as spread out as possible. The algorithm works by defining a minimum radius allowed and iteratively testing radii from the maximum to the minimum allowed. At each radius step, if a point is further away from all the other already picked points by at least the radius step, we pick the point. The following is the result.

Interesting points before and after ANMS

After ANMS, we convert all the subsampled intesting points into feature descriptors. We pick the 8 by 8 blocks in the grayscale image around each intersting points and subtract away the min value then rescaled the values to 1. Here are some sample patches.

Sample image descriptors

II. Feature matching

Once we have a number of feature patches from each image, we have to match them together. We just use L2 norm to measure the similarity between each pair of patches from two images and for one patch in image 1, we pick the best matching patch in image two. However, we do not want to compromise with the most similar one among a group of totally unsimilar patches, we threshold the ratio between best match and second best match. If the ratio is small, that means the best match in much more similar than the second best match, which means that the best match is probably "the one".

III. RANSAC

Once we have some reasonable pairings, we want to pick the best few to recover the transformation. We do this by randomly selecting required number of points, then calculate the transformation, with the transformation, we transform all the source points and compare to the actual target points. If the difference is within a threshold, we call it an inlier. For each iteration of RANSAC, if we have more inlier, we update our picks. We run RANSAC for 100 iterations and keep the iteration that gives the most inliers. These inliers will be the final feature points.

Final feature points between two images. We can observe that these points have pretty good correspondence and they are all located at corner-ish locations

IV. Automated stitching examples.

Lamp post (Automated)
Lamp post (Manual)
Stone seat (Automated)
Stone seat (Manual)
Beach (Automated)
Beach (Manual)

We can observe that the automatic stitching has pretty good results in term of matching. However, since the algorithm has no idea of semantic content, the result is not as aesthetically pleasing. For example, in the lamp post scene, the building on the right is slanted since the algorithm has no idea that the build should be upright. Also, in the beach scene, I could not get the algorithm to match the last building on the right since there are too much noisy data points both from the building and the sea. In retrospect, if the third image was taken with part of the building, it will be much easier. On the other hand, producing the beach scene manually is pretty hard, as shown in the

V.Reflection

I learnt a lot from this project. The first thing I learnt is implementing algorithms described in a published paper, which really requires you to understand what the author was thinking at the time of writing the paper. Then, having the correct algorithm is one half of the challenge, the other half is to tune the parameters. In this project, there are various parameters that control different aspect of the algorithm. Be able to interpret the effect of different parameters helps with having the best combination of parameters. Also I learnt about the importance of robustness of the algorithm. For now, I have to tweak the parameters differently depending on different image inputs. There is much to be desired in terms of robustness for my implementation.