CS 194 Project 5
Architecture

Creating Image Mosaics

Introduction

In this project, we will be creating mosaics automatically by stitching images together according to features at keypoints. There are two parts to this project. First we will be doing image rectification. Then we will stitch images together using similar keypoints on both images. Finally, we will show how to automatically detect corresponding points in images. All code for this project can be found at this Colab Notebook here
The drive folder for this data can be found here .

Image Rectification and Stitching

We have two images of the same object such that the camera is rotated to produce one versus the other. What is amazing, is that solely through linear algebra manipulation, we can actually stitch these images together.

Computing a Homography

In order to do this properly, we need to compute a mapping between corresponding points of the image. First, we know that this mapping can be represented as a 3D matrix. In particular, if we have two corresponding points $(x, y)$ and $(x', y')$ we want to be able to find a matrix that can map each point onto its correspondence. By nature of this transformation, we may need to rotate, scale, resize and project. However, each of these transformations is linear! Thus, we can use the following matrix to compute a projective homography from one image to the other. $$ \begin{bmatrix} wx' \\ wy' \\ w \\ \end{bmatrix} = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \\ \end{bmatrix} \cdot \begin{bmatrix} x \\ y \\ z \\ \end{bmatrix} $$ Since we have eight variables to solve for, we need exactly 4 pairs of points to solve for this homography! Unfortunately, however the homography is very delicate when it comes to choice of points. Thus, we want our homography to be robust to the number of points we use. As a result, we will use a least squares methodology to solve for our system. But wait! Right now our system is not in the least square form. We need it to be in the form of $Ax = b$ where $b$ is the set of unknown variables we want. Luckily, algebraic manipulation comes to our rescue! After some massaging, we get the following equation $Ax = b$ for just a single point. $$ \begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -xx' & -yx' \\ 0 & 0 & 0 & x & y & 1 & -xy' & -yy' \\ \end{bmatrix} \cdot \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \\ \end{bmatrix} = \begin{bmatrix} x' \\ y' \\ \end{bmatrix} $$ Adding more points simply means adding more rows to the first matrix and latest vector! Thus, we have our least squares formulation! We know solve for $x$ using the pseudo-inverse. That is we compute $(A^TA)^{-1}(A^T)b = x$ and we are done.

Image Warping

Once we have computed the homography matrix $H$ from above, we need only finish off by warping the images. To do this, we first compute the bounding box for the image we want to warp. Then, we use a special function that gets all of the pixel points within the bounding box. These represent all of the new pixel values we have to fill in. Then, we apply inverse warping to each of these three points. To apply inverse warping, we apply the inverse transformation to each point, which will land each output pixel point back onto the original image. We take the average of the pixel's around that point to get the value of that pixel. This is the pixel value that will go in for the output pixel.

Image Rectification

In order to do image rectification, we need only map the points to a pre-defined set. This works best when there is a known square or rectangular feature in the image. In our case, we will rectify this image so the bookshelves in the second image above face head on! Doing so we have:
The second image is what happens when we set our mapping so that the bookshelves are flipped. In this case we still see the bookshelf head on, but the entire image is flipped over! Let us try another example, this time on a picture in my living room!

Image Stitching

We can also attempt to stich images together by manually finding keypoints and mapping them to each other. For this, we use 8 keypoints for the bookshelf that is in the image. Doing so for both pictures, we see the following:
Now, we can run a function that computes the homography of the former onto the latter. Once we have that, we can blend the two images using a Laplacian Stack in order to get our final output.
As you can see, even with the manual points defined, this homography could be improved. A small mistake can cause this homography to look a little off. Here is another example, with better results. Even so, the picture in the top left looks very distorted.
One final example is shown here.

Takeaways

I think the bigget takeaway for this part of the project was that linear algebra is powerful and a great tool to ensure that the homography matrix is not too crazy. I love how we are able to combine ideas from other parts of the course (i.e inverse warping, and laplacian pyramids) to produce a really powerful mosaic blending application. The mosaics look really good, even if the keypoints are manual.

Automated Feature Detection

Harris Corners

In order to find features that we can map from image to image, we try and find as many corners as possible between any two images. To do this we use Harris Corners . Harris corners are a specific type of points in an image that are filtered from a filter which gives each point in an image a "Harris Value" or H-value. The harris value describes, in some sense the "corneriness" of a pixel. We also limit our pixels to be approximately 20 pixels away from the edge, since we will compute patches with images from these harris corners. Here are the harris corners for the first image above. The first image is hard to see, so we downsampled the image points randomly and chose 500 for the second image:

Adaptive Non-Maximal Suppression

As we saw in the last image, we have a lot of corners, and this can get very expensive to work with especially for large images. Thus, we apply adaptive non-maximal suppression in to find the best corners such that corners are relatively spaced out. Unfortunately, we cannot just take the top corners with the best H-values since they tend to be clustered together. To do this, we solve a sort of mini-max problem. First, we sort the points by H_value. Then, for each $ \text{point}_i$ we compare it to a set of points $P_i$ that satisfy the following relation: $$ P_i = \{ p_j : f(p_i) < c_{\text{robust}} f(p_j) \} $$ where $f(p_i)$ is the H-value associated with $p_i$ For each point in the feasible set of $P_i$, we compute the minimum distance of these points called $r_i$ for each point. We then take the top $n_{\text{ip}}$ points with the highest $r_i$. Thus, this finds points with large H-values but that are also relatively far away from other points with comparable H-values. This results in a more evenly distributed set of points with larger H-values. Often these are local maximums of H-values in their neighborhood. In our case, $c_{\text{robust}}$ is $ 0.9$ and $n_{\text{ip}}$ is $ 500$ . We formally can think of $r_i$ as follows: $$ r_i = \min_{j \in P_i} |p_i - p_j| $$ Doing so results in the following image of the "best corners" in the image.
As you can see, these points represent a better set of corners than the previous set.

Automated Feature Mapping

Extracting and matching features

To extract features, we simply take a 40 by 40 patch around each pixel in the final set of points for each image. We then blur the 40 by 40 patch down to an 8 by 8 feature to get rid of a lot of noise and high signals that would distinguish matching points. Finally, we flatten the matrix into a 64 vector for each point in the image. We will use these features to compare points between images. To match features, we first compute the distance between each point with each other in matrix. Once this is done, we find the nearest neighbor for each point by running a sort on this matrix on a given axis. We keep a point if its first nearest neighbor is significatnly better than its second nearest neighbor. This is a measure of confidence that a pixel has in its match. Since we have so many corners, we can afford to throw away even good matches if there is a lack of confidence. Doing so yields very surprising results:
It is clear that there are a significant number of correct matches in each image, which at first seems counter intuitive.

RANSAC

With any image, there are going to be a variety of mismatched points. As you can see there are a lot of them in the above image. Some points aren't even in the overlapping field of view of the images. Thus, we use RANSAC in order to find the best matching points. Since homogrpahical computation only requires 4 points, we choose 4 points at random, and apply a homography to the rest. We find the points such that the mapped points are within some epsilon (we set it to 5) distance of its corresponding feature. Our new set of points is the largest set computed over 1000 iterations of this process. This is really effective, since the probability of finding a good matching in 1000 iterations is very high, so we are almost always guarunteed a good feature set to use Least Squares on in the future. After using RANSAC, our feature mapping is as follows:
As you can see its basically a perfect matching between the correct points. We are now ready to plug it into our previous algorithm to generate our final output.

Final Output

Computing our final output, we have the following:
This is MUCH better! Let's check out some other examples. Here is my living room.
Now let us do a 3 way mosaic. Here is an image of a painting in my family room.

Takeaways

I think the bigget takeaway for this part of the project is that redundancy is very key to a lot of photo algorithms. We used a lot of points, when homographies only require 4. But it was those tons of points that gave us a very robust algorithm. It is unreasonable to expect that all images will be perfectly aligned all the time and brightness features change a lot. But there will always be at least some points that match well with others and I think using a ton of points was a big part of that. At each stage of the algorithm we would trim down points more and more until we had a really good mapping. RANSAC in particular helped filter out the bad apples. Feature matching was also interesting. I found it counter-intuitive that points with the best matches might not be the best points if they match well with multiple parts of the scene. But we had enough redundancy to simply get rid of those points. This project definetely changed my perspective on image processing. It does not have to be perfect, but enough redundancy makes the algorithms robust.