To get the data for this project, I took pairs of pictures on my iPhone throughout the day, whether it's of my Berkeley neighborhood, apartment building, or living room.
We have the following equations, which we can rewrite to set up a linear system of equations to solve for the homography matrix $H$ using least squares. $$\begin{bmatrix}x_a \\ y_a \\ z_a \end{bmatrix} = H \begin{bmatrix}x \\ y \\ 1 \end{bmatrix}, \begin{bmatrix}\hat{x_1} \\ \hat{y_1} \\ 1 \end{bmatrix} = \frac{1}{z_a} \begin{bmatrix}x_a \\ y_a \\ z_a \end{bmatrix}$$
We can rewrite these equations to get the following $$ \begin{bmatrix}\hat{x_1} z_a \\ \hat{y_1} z_a \\ z_a \end{bmatrix} = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \\\end{bmatrix} \begin{bmatrix} x_1 \\ y_1 \\ 1 \end{bmatrix} $$
We can then substitute $z_a = h_{31} x_1 + h_{32} y_1 + h_{33}$ into the equations $\hat{x_1} z_a = h_{11} x_1 + h_{12} y_1 + h_{13}$ and $\hat{y_1} z_a = h_{21} x_1 + h_{22} y_1 + h_{23}$. We can repeat this process for all of the correspondence points we've defined $(x_i, y_i, \hat{x_i}, \hat{y_i})$. After setting $h_{33} = 1$, we can define the following system of equations: $$ \begin{bmatrix} x_1 & y_1 & 1 & 0 & 0 & 0 & -x_1 \hat{x_1} & -y_1 \hat{y_1} \\ 0 & 0 & 0 & x_1 & y_1 & 1 & -x_1 \hat{y_1} & -y_1 \hat{y_1} \\ x2 & y2 & 1 & 0 & 0 & 0 & -x2 \hat{x2} & -y2 \hat{y2} \\ 0 & 0 & 0 & x2 & y2 & 1 & -x2 \hat{y2} & -y2 \hat{y2} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ x_n & y_n & 1 & 0 & 0 & 0 & -x_n \hat{x_n} & -y_n \hat{y_n} \\ \end{bmatrix} \begin{bmatrix} h_{11} \\ h_{12} \\ h_{13} \\ h_{21} \\ h_{22} \\ h_{23} \\ h_{31} \\ h_{32} \\ \end{bmatrix} = \begin{bmatrix} \hat{x_1} \\ \hat{y_1} \\ \hat{x_2} \\ \hat{y_2} \\ \vdots \\ \hat{x_n} \\ \hat{y_n} \end{bmatrix} $$
After defining the homography matrix, we can warp the images to be in the same projection space, letting us combine them down the road. Below are the resized im2's, the warped im1's, and the original im1's.
For image rectification, I define the correspondence points of a square-like object in the image and compute the homography matrix with the manually defined points $[[0, 0], [w, 0], [w, w], [0, w]]$. This flattens the square and "rectifies" the image.
For this part, I programmatically generated alpha masks to blend the resized im2 images with the warped im1 image. The masks were generated such that zones where only im2 pixels exist were set to $\alpha=1$, zones where only im1 pixels exist were set to $\alpha=0$. Zones where both pixels exist (contested regions) are set to $\alpha$ values that are negatively correlated with the distance to the center of the correspondence points.
$$\text{mask}[i, j] = \begin{cases} 0 & (i,j) \in (\text{im1} \setminus \text{im2}) \\ 1 & (i,j) \in (\text{im2} \setminus \text{im1}) \\ \gamma \sqrt{(i - c_r)^2 + (j - c_c)^2} & (i,j) \in (\text{im2} \cap \text{im1}) \end{cases}, \exists \gamma \in [0, 1]$$
This project was extremely interesting, but I learned how painstakingly difficult it is to define quality correspondence points. Furthermore, the importance of masking really stood out to me, because even if the correspondence points were good, stitching them together into a seamless photo is still a nontrivial task. While I only used simple alpha masking with a naive mask-creation algorithm, a better implementation would be to do multiresolution blending with a more sophisticated mask-creation strategy. Lastly, learning about how homography matrices can project images into the same plane was really cool!
In this section, I used the starter code provided in harris.py. However, I modified it such that it only took points with $f_{HM}$ values higher than a parameter threshold threshold_abs, since the original code only looks for local minimums with no thresholding.
As the paper suggests, I set c_robust = 0.9, and I keep the top 250 points with the largest radius. Using adaptive non-maximal suppression allows us to find interest points that are spaced out well, allowing us to better define the homography.
To generate the feature descriptors, I extracted 45x45 image patches and then downsampled them into 9x9 feature patches. Here are some examples:
By extracting feature descriptors from each of the ANMS interest points, we can compare how similar two points are using SSD as the metric. For each interest point $i$, we measure the value $\frac{E_{NN_{i1}}}{E_{NN_{i2}}}$, where $E_{NN_{ij}}$ is the $j$th smallest SSD error between interest point $i$ and other interest points. If this value is below a predetermined threshold, then we accept it as a match.
Given our various matches, we can use a robust method like RANSAC to find true matches to compute our homography. In my RANSAC implementation, I randomly select four matches, compute the respective homography matrix $H$, and check to see how many other pairs satisfy this homography (the inliers). I repeat this process iters amount of times, and take the largest set of inliers. This corresponds to the most accurate homography matrix. To determine whether a pair satisfies the homography, we use $H$ to project the first point, and measure this projection's error with the second point. If the projection error is smaller than a parameter eps, then the pair is an inlier.
With our auto-calculated homography matrices and correspondence points (from RANSAC), we can generate auto-stitched mosaic images by recycling code from Project 4A. The results are below.
It was a really great learning experience to implement auto-stitching. I learned how to digest research papers and implement the described methods / algorithms, and I also learned about the practicality and power of RANSAC.