Project 4 - Alec Li

Stitching Photo Mosaics


Part A: Image Warping and Mosaicing

Taking Images

Below are some images used in this project.

Images in 6th Floor Soda:

Images from the Space Needle in Seattle:

Images of the Campanile:

Recovering Homographies

Firstly, we need to define correspondence points between pairs of images. With these correspondence points, suppose we have points \(\vec{p}_i\) in image 1 which correspond to points \(\vec{p}_i'\) in image 2.

For example, we have the following correspondence points for the Soda images:

We want to recover a projective transformation such that \[ \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} wx' \\ wy' \\ w \end{bmatrix} \]

If we expand this out, we have the following system of equations: \[ \begin{align*} &\begin{cases} ax + by + c = wx' \\ dx + ey + f = wy' \\ gx + hy + 1 = w \end{cases} \\ \implies &\begin{cases} ax + by + c = (gx + hy + 1) x' \\ dx + ey + f = (gx + hy + 1) y' \end{cases} \\ \implies &\begin{cases} ax + by + c - gxx' - hyx' = x' \\ dx + ey + f - gxy' - hyy' = y' \end{cases} \\ \implies &\begin{bmatrix} x & y & 1 & 0 & 0 & 0 & -xx' & -yx' \\ 0 & 0 & 0 & x & y & 1 & -xy' & -yy' \\ \end{bmatrix} \begin{bmatrix} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{bmatrix} = \begin{bmatrix} x' \\ y' \end{bmatrix} \end{align*} \]

We can then solve for an approximate solution given an overconstrained system using least squares; we'd just stack the various correspondence points in the matrix and the constant vector.

Warping Images

Once we have the estimated homography \(\mathbf{H}\), we want to re-project one image onto another to stitch them together. Here, suppose we want to project image1 onto image2.

We'd like to use an inverse warp to ensure the reprojection is smooth, so we'd need to compute the bounding box of the final image. Here, we first take the four corners of image1 and apply the homography to re-project it onto image2. This gives us the polygon in which the final image will be placed. The next step is to interpolate the pixels in image1 which will end up at each pixel in the final image. To do this, we use scipy.interpolate.RegularGridInterpolator on the pixels of image1, and use the interpolated values to set the corresponding pixels in the final image.

The next issue to resolve is if we have multiple images, and we do not have direct homographies between a given pair of images. However, if we have a homography from image A to image B, and from image B to image C, then we can simply multiply the two homographies to compute the homography from image A to image C. In this way, we are able to simply give pairs of correspondences between images in a chain, and we can always figure out the homography between any two given images.

Graph Traversals

Shortest Paths

An additional issue with chaining homographies is that sometimes we don't know which homographies to use in order to chain from image A to image B if there is no direct correspondence. This boils down to a classic graph traversal problem, where we want to find the shortest path between two images, and we can follow this shortest path to decide which homographies to chain.

The graph of images can be constructed by adding edges between images that have correspondence points specified (and no edge otherwise), and where each image is a vertex in the graph.

When finding shortest paths, there are no weights on each edge, so the shortest path is equivalent to the smallest number of homographies we need to use in order to transform one image to another.

Centrality

As a minor optimization (and automation), an additional issue with the image warping is in deciding which image to use as the base image for other images to project onto. We could hard-code this in (or use an additional CLI argument), but a simple way to automatically detect the best center image is to find the "center" of the graph of images (as constructed earlier).

There are many possible metrics for this in graph theory, but the one chosen here is a shortest-path betweenness centrality measure. In essence, this is a measure of the fraction of all pairs shortest paths that pass through a given node; the most "central" node would be the node that has the largest betweenness centrality.

Implementation

We use networkx.betweenness_centrality to compute the centrality values for each vertex, and take the maximum as the base image.

With this base image finalized, we can then compute a single source shortest paths tree starting from this base image; we don't need an all-pairs shortest path lookup here, since we only care about reprojecting onto the base image. Whenever we want to reproject a new image, we can simply look up the shortest path to the base image and multiply all the homographies along the way.

As an additional optimization, we can also process images based on their distance to the base image; this allows for the image padding to be as minimal as possible at each iteration, so we don't need to work with gigantic images every single time.

Warp Results

If we perform this naive warp directly on a few images, we have the following result:

Naive warping on first two images
Naive warping on first three images
Naive warping on all images

With the warping implemented, we can also rectify an image. Taking the first image of Soda hall and the Campanile as examples, we can select four points to map to a rectangle (with appropriate sizes to ensure the output image is at a good resolution). Below are the results.

Points selected for rectification
Rectified image
Points selected for rectification
Rectified image

One thing to notice here is that these results are very sensitive to tiny errors in the point selection; since the points were selected manually, the large deformity in the top-right corner is due to the very slight misalignment in the correspondence points, combined with the perspective distortion in the images.

The rectified results may also seem like the points aren't actually rectangles, due to the perspective context in the rest of the image (i.e. the sides of the cabinet and the Campanile are visible), but the points selected do actually form a rectangle if isolated.

Blending Images

The next step is to actually blend the images correctly. Here, we can use an alpha channel to determine where some source image actually lies in the accumulated result, and compute a distance transform to aid in blending two projected images together.

The distance transform sets each pixel inside a region (here, inside a projected image's bounding box) to be the distance to the nearest edge. We'll be using Euclidean distance as the measurement, and the final result is normalized to range from 0 to 1.

Below is a visualization of the distance transform for two of the Soda hall images:

These two distance transforms will then be used as weights for two-band blending. In particular, we first compute a low-pass and a high-pass filter of the image, and blend each part separately.

The high-pass images are blended depending on the maximum value of the corresponding distance transforms. That is, for a given pixel, its value in image A will be taken if its distance transform in image A is larger than its distance transform in image B (and vice versa).

The low-pass images are blended according to a weighted linear combination of the two images, with the distance transforms acting as the weights for the two images.

The final results with full image blending are shown below. (Source images are at the top of the webpage.)

The image stitching for this part has been quite interesting to implement; the homography estimation and automatic stitching and blending were quite satisfying to see work, and the graph theory connections that popped up were fun to integrate as well.


Part B: Feature Matching and Autostitching

The second part of the project involves automatic feature matching and auto-stitching.

Corner Detection

Firstly, in order to find the interest points, we would like to first focus on the "corners" of the image. To find these corners, we use the Harris interest point detector, and use peaks in the matrix as the corner points.

Below is an example of the Harris interest points for a Soda Hall image:

Image with labeled Harris points
Harris matrix

Adaptive Non-maximal Suppression (ANMS)

One thing to notice with the Harris interest points is that they're very clustered together, with a lot of redundant information. Adaptive non-maximal suppression (ANMS) is an algorithm to more evenly distribute the interest points around the image, so that feature matching is a lot more effective.

In implementation, ANMS takes in a radius, and we use a \(k\)-d tree to find the set of neighboring points within this radius, suppressing nearby points if the corner strength is larger. If we instead have a target number of points for ANMS to filter down to, we can use binary search to find the optimal radius to give us the desired number of interest points. The algorithms used here are derived from those described in "Efficient adaptive non-maximal suppression algorithms for homogeneous spatial keypoint distribution" by Bailo et al.

ANMS whittles down the Harris interest points as shown in the below examples:

ANMS with \(r = 16\)
ANMS with \(r = 24\)

Feature Descriptor Extraction

After we have our interest points, we want to extract features for each point. Here, we use axis-aligned features, where the points are sampled from a patch of radius \(r=20\) (size \(40 \times 40\)) with spacing \(s=5\) (feature size \(8 \times 8\)). The samples are taken from a low-pass filtered image (a Gaussian blur equivalent to a downscale for the same sampling frequency) to avoid aliasing.

After sampling, the features are flattened and normalized (to have \(\mu = 0\) and \(\sigma = 1\)) as well.

Features from the Soda Hall image are shown below (without normalization so the features are clearer):

Feature Matching

With features extracted, the next step is to match features together between two images to form correspondences. Here, we use two \(k\)-d trees (one for each image) for efficient nearest-neighbors queries, and iterate through each feature in one image to find the best matching feature in the second image. We also cross-check features (to ensure the best match is in both directions) to reduce outliers.

To reduce outliers further, we also use Lowe's trick, thresholding the ratio between the nearest and second-nearest neighbors; the intuition is that the closest neighbor should be much better than the second-closest neighbor if it is a real match.

Below is an illustration of the result of performing feature matching on the ANMS points; Harris interest points are in blue, while ANMS points are in red. Here, cross-checking is disabled for emphasis (for the next part).

Matches with threshold \(t = 0.8\)
Matches with no threshold, i.e. \(t = 1\)

RANSAC

Although Lowe's trick does reduce the number of outliers, there will still inevitably be several incorrect pairs remaining. The issue with these incorrect pairs in the correspondence points is that least-squares is not very robust against outliers; a single pair of incorrect points can skew the least-squares result wildly, which also skews the computed homography.

To increase the robustness of least-squares when finding the desired homography, we can use RANSAC; we randomly sample 4 pairs of points, and compute the exact homography through these points, keeping track of the inliers (points that are correctly transformed by the homography) as we go. At the very end, we keep the homography that produces the most inliers&emdash;this essentially ignores the correspondence points that do not follow the majority transformation, increasing the robustness against outliers.

Below is an illustration of the result of performing RANSAC on the matched features. Here, cross-checking is disabled for emphasis on the effect of RANSAC.

Matches with RANSAC and threshold \(t = 0.8\)
Matches with RANSAC and no threshold, i.e. \(t = 1\)

Here, we can see that even without the thresholding, RANSAC very effectively removes the outliers and the incorrect pairs in the image.

Results

Below are the results using automatic feature matching and RANSAC. (Differences can be a lot clearer when opening them individually and zooming in.)

Soda Hall panorama with manual correspondences
Soda Hall panorama with automatic correspondences
Seattle panorama with manual correspondences
Seattle panorama with automatic correspondences
Campanile panorama with manual correspondences
Campanile panorama with automatic correspondences

The automatic image stitching in this part has been very satisfying to implement, automating many of the manual tasks from the previous part. RANSAC and ANMS were interesting to learn about and implement as well, making visualizations throughout the process to see how these algorithms work and see their effectiveness.


Further Additions

Automatic Panorama Recognition and Stitching

Given a directory of images, the goal of automatic panorama recognition is to pick the best groups of images and stitch them into a panorama without any human feedback. The brute force way to do this would simply be to iterate through every single possible grouping of images, but this is exponential in the number of images and would take far too long.

A smarter way would be to construct a graph on images in the folder, where each vertex is an image, and there are edges between every pair of images. Each edge would have a weight equal to the number of detected correspondence points between the images---a pair of images with not much similarity would have a small weight, and a pair of images with high similarity would have a large weight.

These edges will then be filtered further by a threshold value (a hyperparameter chosen based on the number of final correspondence points in the graph), so that we disconnect the graph of images into connected components of highly similar images.

For each one of these connected components, we then compute a maximum spanning tree, using the prior edge weights. This maximum spanning tree will give us the minimum number of correspondence pairs required to stitch all the images in the connected component together, while following pairs with the highest similarity.

We can apply this panorama recognition to a directory of 50 images (all from when I was in seattle a couple years back). Below is a visualization of the graph of connected components.

After computing the maximum spanning trees, and stitching the images together, we have the following panoramas, ordered in increasing size of the connected component of images:

Batch 0
Batch 1
Batch 2
Batch 3
Batch 4
Batch 5

There are some artifacts in these panoramas, mostly due to the inconsistencies across the images; they were not taken with panorama stitching in mind, so they have varying camera intrinsics, zooms, etc. This causes a lot of trouble with the homography matching (especially apparent in batch 1, though the other batches also have some misalignments), since there are a lot of similarities between images, with no possible projective transform to align them.

Further, due to the large distortions at the left and right edges of the panoramas, some images were left out of the stitch, as they made the image way too large (ex. 20k pixels along one side). Quite a few images were left out especially in the later batches, with more images stitched together.

Cylindrical Projections

Rectilinear projections have a major drawback of large distortions at the edges; as the field of view becomes wider, the distortions become more and more extreme. This can be seen in the following Soda hall panoramas earlier.

A cylindrical projection solves this issue; we can reproject the images onto a cylinder, and stitch these projections together instead. To compute this cylindrical projection, we have the following equations: \[ \begin{align*} \theta &= f \arctan\mathopen{}\left(\frac{x}{f}\right) \\ h &= f \cdot \frac{y}{\sqrt{x^2 + f^2}} \end{align*} \] The projected image will use \((t, h)\) coordinates as the new axes, after shifting and resizing the image bounds.

To warp the image pixels, we use an inverse warp, projecting the pixels in the image bounds from \((t, h)\) coordinates back into \((x, y)\) coordinates to interpolate the pixels. Here, the inverse warp can be computed with: \[ \begin{align*} \hat{x} &= \sin\mathopen{}\left(\frac{\theta}{f}\right) & x &= f \cdot \frac{\hat{x}}{\hat{z}} + c_x \\ \hat{y} &= \frac{h}{f} & y &= f \cdot \frac{\hat{y}}{\hat{z}} + c_y \\ \hat{z} &= \cos\mathopen{}\left(\frac{\theta}{f}\right) \end{align*} \] Here, \((c_x, c_y)\) marks the translation to the center of the image.

To ensure the inverse warping is of the highest quality, we perform all of the transformations at once; in particular, we first reproject the original image into cylindrical coordinates, and then apply the homography to project to the second image. When computing the homographies, we use the reprojected points (without projecting the image, since we don't need the actual pixels to compute the homographies), so that the homographies can be generalized to other kinds of projections.

With this cylindrical projection, we have the following results:

Soda hall, \(f=700\)
Campanile, \(f=1000\)
Seattle batch 0, \(f=600\)
Seattle batch 1, \(f=600\)
Seattle batch 2, \(f=500\)
Seattle batch 3, \(f=1000\)
Seattle batch 4, \(f=600\)
Seattle batch 5, \(f=500\)

An observation to make here is that the distortion at the left and right edges is no longer present; we instead have a more even distortion across the entire panorama, making a much more visually appealing image. However, there is still some distortion at the bottom and top of the image, especially evident with tall photographs. This is the drawback of using cylindrical projections; wide images work great, but tall images do not.

The focal lengths were tuned manually by looking at the outputs, since the units are tricky to match, and the EXIF data for most of the images are not available.

Spherical Projections

Spherical projections are another way to combat the distortions at the edges of rectilinear projections, with the additional advantage that tall images have less distortion at the bottom and top. With spherical projections, we have a similar process as we did for cylindrical projections; we first reproject the image into a sphere, and stitch these projections together instead.

Computing the spherical projection is a little more involved, due to the nuances in the inverse trigonometric functions. Firstly, we need to project our image point from \((x, y, f)\) onto the unit sphere: \[ \begin{align*} x' &= \frac{x}{\sqrt{x^2 + y^2 + f^2}} \\ y' &= -\frac{f}{\sqrt{x^2 + y^2 + f^2}} \\ z' &= \frac{y}{\sqrt{x^2 + y^2 + f^2}} \end{align*} \] Here, we orient the image plane in the \(xz\)-plane; spherical coordinates are defined with an angle from the \(xy\)-plane, so placing the image plane in the \(xy\) plane causes issues. The plane is also placed in the \(-y\) side of the axis, and flipped vertically, so that the computations work out better (the final image tends to be flipped across certain axes otherwise; some of these angles are computed with respect to a particular axis, and these hacky flips are necessary to keep it consistent).

After projecting onto the unit sphere, we can now convert these cartesian coordinates into spherical coordinates: \[ \begin{align*} \theta &= f \arctan\mathopen{}\left(\frac{y'}{x'}\right) \\ \phi &= f \arccos(z') \end{align*} \] Here, since we're on the unit sphere, this conversion is a lot simpler than the generic one. The multiplier by \(f\) rescales the coordinates to our final image size. After shifting and resizing the image bounds after this transformation, we have our spherical reprojection; the resulting image will have axes in \((\theta, \phi)\) rather than in \((x, y)\).

To warp the image pixels, we can perform an inverse warp, with the following transformations: \[ \begin{align*} \theta' &= \frac{\theta}{f} & \hat{x} &= \cos(\theta') \sin(\phi') & x &= f \cdot \frac{\hat{x}}{\hat{z}} + c_x \\ \phi' &= \frac{\phi}{f} & \hat{y} &= -\cos(\phi') & y &= f \cdot \frac{\hat{y}}{\hat{z}} + c_y \\ & & \hat{z} &= -\sin(\theta') \sin(\phi') \end{align*} \] Again, here we have some hacky sign flipping to ensure the image ends up in the correct orientation, due to how spherical coordinates are defined.

With this spherical projection, we have the following results:

Soda hall, \(f=700\)
Campanile, \(f=1000\)
Seattle batch 0, \(f=600\)
Seattle batch 1, \(f=600\)
Seattle batch 2, \(f=500\)
Seattle batch 3, \(f=1000\)
Seattle batch 4, \(f=600\)
Seattle batch 5, \(f=500\)

Here, notice that the images are quite similar to the cylindrical projections, except for smaller distortions at the top and bottom of the images. In particular, the Campanile panorama (a vertical panorama) looks a lot better with a spherical projection compared to a cylindrical projection.