Image Warping and Mosaicing Part B

Phillip Kuznetsov

cs194-26-aea

For this part of the project, we extended the image warping system built in the previous part to not only work with annotated images, but have the capability to automatically identify correspondences between images and thus remove the human from the loop. This method matches much closer to the method employed in the IPhone camera and Google Photos. The key points of this method are covered in the paper Multi-Image Matching using Multi-Scale Oriented Patches. First, we implement a method for finding all of the keypoints in the use a Harris feature detector. Next, we use adaptive non-maximal supression to get a nice distribution of these points throughout an image. With these points, we extract 8x8 features that we use for identifying correspondences. We try to identify good matches by using the nearest neighbors of the patches between each image, and finally we use RANSAC to find a set of correspondences that create the best homography matrix between the images.

Harris Feature Detector

The Harris corner detector is a nifty algorithm that finds points inside of an image that correspond to corners. These points fit the property that shifting the the point both vertically and horizontally would cause a large change in the patch of pixels around the point. This can easily be measured using properties of a matrix composed of the derivatives around each point. For a point $(x, y)$, the matrix $H$ corresponding to the outerproduct of the gradient smoothed by a gaussian filter. To measure the "strength" of the corner, we get the harmonic mean of $H$'s eigenvalues: $$ f_{HM} = \frac{\text{det} H(x, y) } {\text{tr} H(x, y)} = \frac{\lambda_1\lambda_2}{\lambda_1 + \lambda_2}$$ You can easily calculate this by using the skimage.transform.corner_harris function.
Harris corner with no thresholding
Harris corners with harmonic mean of eigenvalues $\geq 0.5$.

Adaptive Non-maximal Suppresion

Simple thresholding produces a number of tangible points, however, many of them are prevalent in high frequency areas such as the texture of leaves. This potentially could lead to catastrophic failure as the high frequency data can be confusing during our matching process. To get a better homography, we would instead like a feature finding algorithm that identifies features that are well-distributed throughout the image. That's where Adaptive non-maximal suppression (ANMS) comes in. ANMS ensures that we only choose points that lie only at a certain distance away from other hotpoints. The basic premise of the algorithm is to first calculate the suppression radius for all points. The suppression radius is the distance to the closest neighbor in the original image that has a harris measure scaled by $c \leq 1$ greater than the point we are evaluating. We then sort the points by their radius, and select the top $n$.
Regular 250 corners
Regular 250 corners
Regular 500 corners
ANMS 500 corners

Feature Matching

Feature Descriptors

After identifying the important points, the next task is to identify the matching correspondences between the images. We need to create feature descriptors for every single point and use these to create coorspondences between them. We achieve this by first grabbing a 40x40 neighborhood around the image, then subsampling several layers up a Gaussian pyramid to get an 8x8 downsampled version of the image. Then we standardize the bias and gain of the image by subtracting the mean and dividing by the standard deviation.
The source pixel
The 40x40 sample window
The 8x8 sampled from a Gaussian pyramid of the 40x40 window
The same sample normalized for bias and gain.

Identifying Good Matches

Using the feature descriptors, we need to identify the similar matching points between images. There's a set of problems with a simple comparison appraoch. There isn't a gauranteed one to one correspondence between the points in each image. Some points that were recognrized in one image might not be recognized in the other image due to changes in lighting or distortion from changings in perspective. Additionally, there are a set of points identified in each image that will never appear in the other image because they don't appear in the frame at all. Instead of looking for the feature descriptor in image A that matches most closely to a feature descriptor in image B, we want to look towards a technique that Professor Efros calls the "Russian Granny Principle." The basic idea comes from a story of a woman who was currently looking for a husband. She soon narrowed her search down to two men - but she couldn't decide. Both had proposed and she considered both equally as good. Searching for an answer, She confides in her Russian Grandmother who tells her this shocking statement - "If you can't decide between two suitors - then neither are good for you". The same principle can apply here as well. Instead of choosing the first nearest neighbor, we try to find the feature descriptors that are the most similar and have no clear competitor. We can measure this by thresholding values that have a certain ratio between the first and second nearest neighbors. This ensures that we choose those matches that are very confident in their predictions -while eliminating those that do not.

RANSAC

The final piece in the puzzle is to identify the matching points. Luckily, there is a neat randomized algorithm known as RANSAC that we can use to our advantage. The premise of this algorithm is that you can repeatedly sample enough pairs to calculate a homography, and then observe how many of the canddiate points properly line up. You repeat this process of sampling some set number of iterations (calculated using the Law of Large Numbers) and you choose the homography that produces the most inliers. Then, using the least squares solving technique for hte homography, you use all of these canddiate points to calculate a new homography.

Results

Hand Annotated Mosaic at Hearst Mining Circle.
Automatic Mosaic at Hearst Mining Circle.
Hand Annotated Mosaic at Leconte.
Automatic Mosaic at Leconte.
Hand Annotated Mosaic at the Bayfair Bart station
Automatic Mosaic at the Bayfair Bart station. You'll notice that this fails tremendously with the automatic. This is likely a result of the scale of the panning used in creating this image. All of the scenery is much farther away from where I shot the photos than the other phots. In the hand annotated mosaic, you can see other artifacts that emerge from a well-annotated example.
You can see that the corresponding points used to create hte homography between the images appears to be correct. However, the high variety of distance from the observer means that there is a lot of error in the resulting transformation.

What I learned

I thought this was a really cool project that used some old school detection techniques to determine important correspondence points between images. In this heyday of neural networks, it's easy to get caught up in the idea that every problem can be solved via learning. However, there are clear simple techniques that clearly outperform these methods, especially in this situation. Definitely a very eye-opening result.