Assignment 3-2

CS 194-26: Image Manipulation and Computational Photography

Kenny Chen

Project 5: [Auto]Stitching Photo Mosaics

Overview

We first describe a method for manual panoramic image stitching, laying the groundwork for automatic photo mosaicing. We first describe how to recover homographies from user-selected correspondences between two images. We then describe the warping and manual mosaicing process. After we describe these preliminary results, we describe the auto-stitching method, which involves the implementation of feature matching and outlier detection algorithms for determining point correspondences. We then use these points to define the homography as in the manual case and then create the mosaic in a similar way.



Part A: Image Warping and Mosaicing

We take two or more photographs and create an image mosaic by registering, projective warping, resampling, and compositing them. Along the way, we compute homographies and warp images.

Shooting the Pictures


We shoot 2 or more images so that the transforms between them are projective. We do this by shooting from the same point but in different view directions and overlapping fields of view. We can get images of pretty good quality if we take pictures of planar surfaces or of very far scenes from different views. The images used in this project were acquired using a Google Pixel 3XL.

Recovering Homographies


We want to solve for the homography matrix $H$, which is the matrix used to transform image coordinates, in this case. We want to solve $p^\prime=Hp$, where $H$ is 3x3 with 8 degrees of freedom, $p$ are point correspondences in the image being morphed, and $p^\prime$ are point correspondences in the image being warped to. We have two $n\times2$ matrices that correspond to $(x, y)$ locations of the $n$-point correspondences from the two input images. If we have $n=4$ points, we can solve for $H$ without least squares. If $n>4$, we must use least squares because the system is overdetermined. We opt to choose more correspondences because this decreases noise in the recovered homography.


We want $$ \begin{bmatrix} x^\prime \\ y^\prime \\ 1 \end{bmatrix} = \begin{bmatrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} $$

With some algebra we end up with $$ \begin{bmatrix} x_1^\prime \\ y_1^\prime \\ x_2^\prime \\ y_2^\prime \\ x_3^\prime \\ y_3^\prime \\ x_4^\prime \\ y_4^\prime \\ \vdots \end{bmatrix} = \begin{bmatrix} x_1 & y_1 & 1 & 0 & 0 & 0 & -x_1x_1^\prime & -y_1x_1^\prime \\ 0 & 0 & 0 & x_1 & y_1 & 1 & -x_1y_1^\prime & -y_1y_1^\prime \\ x_2 & y_2 & 1 & 0 & 0 & 0 & -x_2x_2^\prime & -y_2x_2^\prime \\ 0 & 0 & 0 & x_2 & y_2 & 1 & -x_2y_2^\prime & -y_2y_2^\prime \\ x_3 & y_3 & 1 & 0 & 0 & 0 & -x_3x_3^\prime & -y_3x_3^\prime \\ 0 & 0 & 0 & x_3 & y_3 & 1 & -x_3y_3^\prime & -y_3y_3^\prime \\ x_4 & y_4 & 1 & 0 & 0 & 0 & -x_4x_4^\prime & -y_4x_4^\prime \\ 0 & 0 & 0 & x_1 & y_4 & 1 & -x_4y_4^\prime & -y_4y_4^\prime \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \end{bmatrix} \begin{bmatrix} h_{11} \\ h_{12} \\ h_{13} \\ h_{21} \\ h_{22} \\ h_{23} \\ h_{31} \\ h_{32} \end{bmatrix} $$

Warping the Images

To warp the images, we use inverse warping, which gets rid of any noticeable artifacts. We get all the coordinates in an image using the rectangle function in the skimage.draw module. We transform all these coordinates using the homography matrix $H$. We also add a translation matrix, $T$, that shifts the image, as the transformed image may have some values that are out of the frame of the original image dimensions. To find the amount to translate, we pipe the original image corners into the computed homography matrix. The translation values are therefore the absolute value of the minimum $x$ and $y$ transformed corners (if the minimum is negative). We then set all the transformed points in a blank image to their corresponding points in the old image.

Image Rectification

We take some images of planar surfaces and rectify them using the above process such that the plane is frontal-parallel.



Blending the Images into a Mosaic

Usually, after warping an image its shape changes. We therefore have to make sure the final image mosaic will fit the two images. To combine the two images, we use a feathered mask. The mask is a matrix of zeros on one side and ones on the other. The line where the zeros turn to ones is somewhere in the overlap of the two images. We apply a gaussian blur to the mask to make the edges where the images meet less noticeable. The following outlines the process:

  1. 1. Define point correspondences:

  1. 2. Compute homography, warp, fit each image to output size:

  1. 3. Combine images using feathered mask:



We show two more mosaics below:





Results actually don't look half bad. If looking closely, there are definitely some artifacts that could be corrected by using a better mask, but the result looks fine when not scrutinized closely.



Part B: Feature Matching for Autostitching

Picking correspondence points manually is tedious, time-consuming, and can be inaccurate. We now implement a way to do this automatically, following the method described in Multi-Image Matching using Multi-Scale Oriented Patches.

Harris Interest Point Detection

The general idea for automatic image stitching is finding interest points in the two input images and matching those interest points. These matched points will hopefully be used as good point correspondences between the two images. So, we want to first define these interest points. For our features, we find Harris Corners in each image. The idea behind a corner is that any shift in the view of a small patch around the corner will result in significant change in intensity in all directions. We essentially calculate the intensity values over the image and return all points in the image with intensity greater than some threshold value.


To implement the Harris Corner Detector, we use the library skimage.feature functions corner_harris to find the corner intensity value at each pixel of the image and peak_local_max, which returns the coordinates of each corner. We see an example of the detected Harris Corners below. We make the minimum distance between corners $(w+h)/200$, where $w$ and $h$ are the width and height of the image, respectively. This value was chosen somewhat randomly. We do this to make sure correspondences aren't too close together and to speed up the program a bit.


Adaptive Non-Maximal Suppression

The goal of ANMS is to reduce the number of interest points so that our feature matching runs quicker, as the cost of matching is superlinear in the number of interest points. ANMS is also used to ensure that interest points are distributed evenly spatially.


For each interest point, we calculate $r_i$, the minimum suppression radius. This is given by $$r_i=\min_j|x_i-x_j|\, :\, f(x_i)<c_{\textrm{robust}}f(x_j),\, x_j\in\mathcal{I}$$ where $x_i$ is an interest point coordinate and $\mathcal{I}$ is the set of all interest point coordinates. We do not include $x_i$ in this set. $f(x_i)$ is the corner strength at coordinate $x_i$. We obtain this value using the corner strength matrix found using the function corner_harris. We use $c_{\textrm{robust}}=0.9$ to ensure that a neighbor has a much higher corner strength in order for suppression to happen. We then take the top $n_{ip}$ radii values and keep the corresponding interest points. In our implementation, we use $n_{ip}=.4*m$, where $m$ is the original number of interest points. This value was chosen somewhat arbitrarily.


We can see that the output has fewer interest points after running ANMS (points in yellow).


Feature Descriptor Extraction

We use a simplified feature descriptor than described in the original paper.


Some example features.

We take 40x40 image patches surrounding an interest point. We then downscale this patch to size 8x8. We then bias/gain normalize by subtracting by the mean then dividing by the standard deviation of the patch. We do not worry about rotation invariance. This assumption only works well in the case that there is very little rotation in the camera when the images are taken.

Feature Matching

We now match feature descriptors that look similar and are thus deemed good matches. For every feature descriptor in one image, we find the error between it and a feature descriptor in the other image using the sum of squared differences. We take the features with the two smallest SSD values, $e_{1-NN}$ and $e_{2-NN}$ respectively. We use the threshold suggested by Lowe, accepting a match if $$e_{1-NN}/e_{2-NN} < \epsilon$$ where $\epsilon$ is some threshold. We use $\epsilon=.4$. This gets rid of many of the incorrect matches. The intuition Lowe used for this technique was that a feature should only have one match. Therefore the ratio value between $e_{1-NN}$ and $e_{2-NN}$ should be low, because the error between matches should be low and the error between non-matches should be high.


We can see the output has quite a bit fewer points (in blue) after feature matching.


Random Sample Consensus (RANSAC)

We can see that after feature matching, many of the matches are still incorrect. To get rid of these outliers, we use 4-point RANSAC to compute a robust homography estimate. The process is detailed as follows:

  1. 1. Select 4 feature pairs at random.
  2. 2. Compute the exact homography $H$ between the pairs.
  3. 3. Compute inliers, which are features pairs that satisfy $$SSD(p_i^\prime,\, Hp_i)<\epsilon$$

We continue this process for $n$ steps, continuously updating the largest set of inliers. The largest set of inliers will be used to compute the final homography matrix used in the warp. We iterate for $n=10,000$ iterations and use $\epsilon=.5$.


We can see that our output interest point pairs (in cyan) all match.



Final correspondence points.

Some Results

Here are some results comparing manual and automatic point correspondence selection:


As we can see, the two images are pretty similar. On closer inspection, we can see that around where the two images meet, the manual selection mosaic shows some artifacts, as the two images are slightly misaligned. In the automatic mosaic, there seems to be no noticeable artifacts. This is likely because hand-selected point correspondences can be off by a few pixels.


We compare a few more mosaics:



The two images look very similar, but if we zoom into the manual case, the farther bar is slightly misaligned.



Similarly, here we can see the sidewalk is misaligned in the manual selection case, while in the automatic selection there aren't any noticeable artifacts.


Panorama with 4 images.


Challenges

Part A was more annoying than difficult, as I spent numerous days trying to get my transformed image to fit fully without getting clipped. It was very confusing and tedious to figure out whether I was working with $x$ or $y$ coordinates.


A problem I ran into in Part B was that I kept getting distinct features in one image that mapped to the same feature in the other image. I was unfortunately not able to solve this issue, but I found that this would occur in an image with elements that were very similar in different areas of the image. I ended up just not using the image that had this bug. I think I could have fixed this by messing around with the threshold for the feature matching portion.


Jupyter also kept having memory errors, because I was running rather large images. I ended up downscaling some of my images.

What did I learn?

I really enjoyed this project, though it was rather long and frustrating at times. I learned the most about outlier detection. I had essentially never learned about this subject before this class, but the algorithms I implemented really shed a lot of light on how simple these methods are. It was interesting looking at how the outliers kept being removed at each step, and how the correct matches were ultimately chosen.