Project 4 Report

(Part A)

Some nice looking results from part A.

1. Register Point Correspondence

Since there are total 8 freedom in a projective transformation, we need at least 4 points to find the transformation matrix. With only four points, the homography recovery will be very unstable and prone to noise. Therefore, more than 4 correspondences should be provided producing an overdetermined system. However, manually annotated point pairs in high-resolution images may not be the optimal correspondence, thus I used a similar approach in project 1 to finetune the points. Concretely, for each point pair, two patches whose centers are the two points are cropped from the two images, and the left patch is moved within the search range to find the optimal displacement. Below is an example of the keypoints defined in two images, where different color of crosses means different correspondence.

Fig1. Keypoint correspondence.

2 Recover Homographies

The projective transformation matrix contains 8 variables, so solving the matrix with more than 4 point correspondence in an overdetermined system requires using least squares. The formulation for solving the transformation is shown as below.

$math for projective transformation$

Fig2. Computing projective transformation matrix via least-square.

3 Warp the Image

Given the transformation matrix H, we can first compute the projected image corners to decide the resulting image size. Usually the resulting size is enlarged and we need x and y bias added to the projected coordinates. After that, given the corner coordinates, we can find all the pixels lie in the polygon region defined by the four corners. For each pixel, we use inverse transformation to find the corresponding coordinate in the original image. The coordinate is usually fractional, so we need to use bilinear interpolation to decide the pixel value at the position. Finally we will have the warped image and projected keypoint coordinates. The result is shown as below.

Fig3. Warped left image

4 Image Rectification

As a sanity check for image warping function, here I test it with an image of my laptop. I selected the four corners of the keyboard and set the resulting polygon to be rectangle. The original image and the results are shown below. The rectified image seems to be taken from above (though the view is highly distorted), which makes sense since the keyboard has to be a rectangle in the projected image.

Fig4. Rectified laptop.

5 Blend the images into a mosaic

Given the correspondence of the points in the projected left image and the points originally annotated in the right image, we can translate the right image to the appropriate position so that the corresponding points meet each other at the same position. However, we cannot directly add the two images together as shown below, which produces poor result. A very straight forward alpha channel where the area for right image is 0 and all others is 1 produces slight better results, but still the edge of the right image is apparent.

Fig5. Results obtained by direct addition without an alpha channel and a straight-forward alpha channel.

Instead, we need to come up with an alpha channel deciding how much one image contributes to the intersecting area. Ideally, the contribution from the left image of the pixels on the border of the left image should be 1, and vice versa. I tried different ideas to create an alpha channel.

A linearly and vertically decreasing alpha channel, where the left border of the intersection polygon is 1 and the right is 0.
A cosine and vertically decreasing alpha channel, where the left border of the intersection polygon is 1 and the right is 0.
A distance-based alpha channel. Concretely, let dist1 be the distance of a point in the intersection polygon to the nearest border in the left image and dist2 be the distance of the same point to some right image borders. Then dist1/(dist1+dist2) is the alpha value at the position.

Result obtained from linearly decreasing alpha channel is somehow better, and the cosine one is even better because the cosine function is smoother than the linear one and the change in the proportion of contribution is swifter.

Fig5. Results obtained by a linearly decreasing alpha channel and a cosine one.

However, both images have strange vertical edge artifacts since the alpha channel does not take horizontal borders into consideration. Thus, we need some more detailed alpha channel, which is a distance-based one. It is hard to define which borders of the intersection polygon belong to the left image, however, since the borders come from either left or right image, we can find the distance to the nearest border of each image instead. Thanks to many libraries provided in Python, such as GeoPandas, shapely, and cv2.distanceTransform, it can be implemented efficiently without consideration much about the geometry stuff. The alpha channel and the results are shown below.

Fig7. Distance-based alpha channel and its result.

Further, we can use the multi-resolution blending technique as in project 2 to enhance the results. I tried a pyramid with depth=2 and depth=5. The results are slightly better than directly doing alpha feathering. However, a pyramid with depth=5 can be too costly. So my final configuration uses a pyramid with depth=2.

Fig8. Adding multi-resolution blending (depth=2 and depth=5, respectively).

* What I've Learned

Creating a good alpha channel can be really tricky. What's more, it is also important to define good point correspondence. Usually manually annotated ones are not good enough since there's always some suboptimal displacement in a high-resolution image, and it can be hard to justify it through some simple criteria like SSD or NCC. I'm looking forward to how the results may improve with automatic keypoint finders.

(Part B)

6 Harris Corner Detection

To reduce the time for applying adaptive NMS on these points, I modified the harris corner detector wrapper function to apply NMS in a local region within distance of 3. Since showing all the detected corners points will simply fill the image with points, here I showed the top 300 points with highest "cornerness".

Fig9. Harris corners.

7 Adaptive NMS

Basically my implementation for adaptive nms follows the original paper. First, pairwise distance is computed. Then, for each of the point pairs (point1, point2), if point1's cornerness intensity is larger then r times of the other's, the distance between them will be set to inf. In my implementation, r is set to 0.9. By setting this distance to inf, we can discard it when computing the radius of a given point, which is the smallest distance to another point with apparently larger cornerness. Finally, I picked top 300 corners. These top 300 corners are apparently more uniformly distributed than the above image showing the vanilla Harris corners.

Fig9. Harris corners after adaptive NMS.

8 MOPS Feature Descriptor

In my implementation, I used a 33*33 Gaussian kernel for convolution without padding to downsample a 40*40 patch into a 8*8 patch smoothly. After that, the patch is flattened into a vector as the feature descriptor. The 8*8 patch and its corresponding position is shown below.

Fig10. MOPS features and its corresponding position.

9 1NN/2NN Trick

After obtaining the points with their corresponding features in the two images, to remove some of the outliers, one can compute the ratio between the distance in feature space of the nearest neighbor and the second nearest neighbor. A small ratio indicates a good matching, which is perhaps not an outlier. The ratio computed from one of the images is shown below. Empirically, I found that the corners are somehow reliable, thus I choose a higher 1NN/2NN ratio (which is 0.5) so that there are more candidate matching points, which makes the projection estimation better.

Fig11. 1NN/2NN ratio curve.

10 RANSAC

As the final step to filter outliers, RANSAC algorithm is based on the assumption that the projection matrix that fits each outlier can be very different, while the projection matrix fits each inlier can be almost the same. So just randomly select 4 points and compute the projection matrix and estimate how many points are matched. By doing this for several times, we can obtain the maximal set of matched points and compute a better projection matrix based on these points. In my implementation, I set the maximum iteration of RANSAC process to 1000 times and early stop when there are more than 100 or 70% points matched well. If a point is at least more than 20 pixels far from another point in the other image, then it is a bad match. The final matching results are shown below. Points with the same color indicate a matching pair.

Fig12. Final matching keypoints.

Here are the results for automatic keypoint matching. Basically the automatic matching approach performs as well as the tedious manual matching and even better. But for the image of the desk with laptop and screen, human can label more points based on the small content near the screens (e.g., "MacBook Pro" under the laptop's screen), these parts match better than the automatic one.

11 B&W: Rotation-invariant MOPS

I implemented the rotation-invariant MOPS features by uniformly sampling 40 points in the rotated patch. Then the same process is applied as stated above. The results of the MOPS features are shown below. And the final matching result is shown below. Empirically rotation-invariant MOPS does not help much basically because when taking the pictures, the camera is not drastically rotating, thus the features at different locations are almost at the same angle in the two corresponding images.