CS180 Project 4

Theory

q=Hpq = Hp, with qq, hh both in homogeneous coordinates

(wq1wq2w)=(abcdefgh1)(p1p21)\left(\begin{array}{c} wq_1 \\ wq_2 \\ w \end{array}\right) = \left(\begin{array}{ccc} a & b & c \\ d & e & f \\ g & h & 1 \end{array}\right) \left(\begin{array}{c} p_1 \\ p_2 \\ 1 \end{array}\right)

This implies that w=gp1+hp2+1w = gp_1 + hp_2 + 1. We can substitute to get the relations

(gp1+hp2+1)q1=ap1+bp2+c(gp_1 + hp_2 + 1)q_1 = ap_1 + bp_2 + c,

(gp1+hp2+1)q2=dp1+ep2+f(gp_1 + hp_2 + 1)q_2 = dp_1 + ep_2 + f,

or equivalently

g(p1q1)+h(p2q1)+q1=ap1+bp2+cg(p_1q_1) + h(p_2q_1) + q_1 = ap_1 + bp_2 + c

g(p1q2)+h(p2q2)+q2=dp1+ep2+fg(p_1q_2) + h(p_2q_2) + q_2 = dp_1 + ep_2 + f

These equations are not linear in pip_i, qiq_i, but they are linear in pip_i, qiq_i, piqjp_iq_j

Stacking this into matrix form, we get

(p1p21000p1q1p2q1000p1p21p1q2p2q2)(abcdefgh)=(q1q2)\left(\begin{array}{cccccccc} p_1 & p_2 & 1 & 0 & 0 & 0 & -p_1q_1 & -p_2q_1 \\ 0 & 0 & 0 & p_1 & p_2 & 1 & -p_1q_2 & -p_2q_2 \end{array}\right)\left(\begin{array}{c} a \\ b \\ c \\ d \\ e \\ f \\ g \\ h \end{array}\right) = \left(\begin{array}{c} q_1 \\ q_2 \end{array}\right)

Where the left data matrix and right data vector are extended by 2 rows for each additional correspondence. For this to not be underdetermined, we require at least 4 correspondences between the two images. When there are more than 4 correspondences (overdetermined system), we can use least squares to find a good approximate solution.

Preliminary Results - Rectification

For rectification, a flat planar surface in the image was warped onto a rectangular region (hand chosen coordinates). One issue that I faced, particularly with the iPad box, was numerical instability. The data matrix was very ill conditioned and led to a nonsensical projective transform. I verified that even scipy's own ProjectiveTransform would not give a good result in this situation. To remediate this, I tried using regularization on the least squares, but all it did was distort the good projections and had almost no effect on the bad ones. The ultimate fix was to restrict the domain, so instead of warping the corners of the image and using those to generate a mask, I instead cropped the region which would be inverse warped and this seemed to help tremendously.

The approach taken to actually perform the warp was very similar to that of the previous project. Basically, the corners of the image were warped using the calculated homography, a mask was drawn, then the inverse transform was applied and interpolation was used before blitting back onto the canvas for the final result.

Preliminary Results - Mosaic

The warping mechanism here is the same as in rectification, but the homography is computed through correspondences rather than using a hardcoded region in the final image as a target. For the blending, I decided to go with using a distance transform based approach, where the alpha in the overlapping region would be calculated using the ratio d1d1+d2\frac{\mathrm{d_1}}{\mathrm{d_1} + \mathrm{d_2}} where d1d_1 was distance to closest non-overlap point in image 1, ditto for image 2. This yielded ok results (slight artifacting due to misalignment), I think I might try multiresolution blending for stage 2 of the project. Most issues were avoided by having uniform lighting / manually keeping exposure constant / being careful when taking the images to maintain a constant center of projection relative to the distances of the objects being photographed

Touched a little bit of grass

A most cool room in San Diego

Visualization of distance transform from right image.

Part 2: Automatic Correspondence Detection

Corner Detection

For this part, we were allowed to use a pre-implemented version of the Harris corner detector. However, the theory of it is already not too complex; by computing a sum of product of gradients around a patch of the image, then computing its eigenvalues, we can determine if the sum-square intensity difference metric is sufficiently convex around that patch to consider it a corner. (both eigenvalues small → probably a flat region; one eigenvalue small → probably something like an edge). All points above a certain threshold (and not too close to eachother) are marked as interest points and returned from the starter code. This ends up creating an extremely dense set of interest points, which is both slow and impractical for running the later feature matching steps

output of harris corner detector without any suppression

Adaptive non-maximal suppression

For this part, it implements the adaptive non-maximal suppression algorithm described in the MOPS paper to select interest points which are sufficiently far apart from other interest points which would suppress it due to having a substantially larger corner response. In particular, it calculates ri=minixixjxj:h(xi)<crobusth(xj)r_i = \min_i |\vec{x}_i - \vec{x}_j| \forall \vec{x}_j:h(\vec{x}_i)<c_{\mathrm{robust}}h(\vec{x}_j) for all interest points (i.e. distance to nearest interest point which is substantially larger), then picks top-K to be the new interest points, thereby culling many nearby interest points and distributing them relatively evenly throughout the image.

result of suppression with nip=50n_{ip} = 50, crobust=0.9c_{\mathrm{robust}} = 0.9 (in accordance with paper)

Feature Extraction

As suggested in the paper, subsampling a larger patch around a given interest point to form a feature descriptor improves the performance of the descriptors. However, to avoid aliasing, the image should be appropriately blurred before performing this step, such that frequencies above the Nyquist limit are removed prior to subsampling. For this basic implementation, an axis-aligned patch is used, and the interest points / features are only detected at a single Gaussian pyramid level.

Feature Matching

Once these basic features have been extracted, the pairwise distance between features in one image and features in another are computed. The ratio of the error for the first nearest neighbor and the second nearest neighbor is computed then used to threshold against (Lowe’s trick), as this provides better discriminatory power than thresholding based on 1-nearest neighbor distance alone.

Example result for matching; lines connect corresponding features. You can clearly see some spurious correspondences near the blinds and hinges due to the symmetry of these objects making them somewhat nondistinctive, but the next step will resolve this

RANSAC for robust homography estimation

The correspondences calculated in the previous step are quite solid, however there are still some incorrect correspondences / outliers which would seriously impact the quality of a least squares solution calculated directly using these correspondences. To fix this, a randomized consensus algorithm is used where subsets of the matches are picked, a homography (least-squares solution) is calculated, and the number of points which are “well-explained” (i.e. L2 distance between transformed point and ground truth) by that homography are counted as inliers. The homography with the largest number of inliers is chosen as the correct one; the inliers are then gathered and a final least-squares homography is calculated. The warping and stitching then occurs the same as in the first part.

The result of automatic stitching. This is almost indistinguishable from the one done using manual correspondences without some serious pixel peeping; some of the area around the medicine bottle on the right is a tiny bit misaligned but that was the only immediately visible defect compared to the manually annotated version.

More Results

Results of ANMS on bedroom. Doesn’t really appear to be distributing features super evenly, though many areas (i.e. carpet, wallpaper) are essentially featureless so that might be expected

Result of correspondence matching. The threshold I set was somewhat low, which is probably why much more of these are correct than was demonstrated in lecture. RANSAC will have an easy time finding a good homography
Again, almost indistinguishable from manually setting correspondences. This one actually appeared to do better on the slats of the headboard of the bed, I was a bit sloppy when annotating those manually.

Much better spread of keypoints throughout the image

Most correspondences seem to be among the branches of the tree, which makes sense…
Image is quite aggressively skewed due to different views, but there isn’t any obvious artifacting in the overlapping regions.

Bells and Whistles: Purely Rotation Transformation Model

Camera 1 has extrinsics of TRTR where RR is a block matrix with 3x3 rotation in upper left, and 0's elsewhere except 1 in bottom right

T=(100tx010ty001tz)T = \left(\begin{array}{cccc} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ \end{array}\right)

TT performs projection (implicitly?) by setting ww' to be equal to (rotated and translated) zz

x=x+txx' = x + t_x

y=y+tyy' = y + t_y

w=z+tzw' = z + t_z

u=fxx+oxu = f_xx' + o_x

v=fyy+oyv = f_yy' + o_y

where ff is focal length of camera. conversion from physical distance to actual distance is based on sensor (pixels / mm). In theory for pinhole cameras, fxf_x should be equal to fyf_y but this doesn't occur in reality due to imperfect sensors having non-square pixels, lens / optics issues, etc. oxo_x, oyo_y account for the fact that the center of image is not at (0, 0) but rather top left of image is at (0, 0) by convention.

Intrinsics matrix is thus just K=(fx0ox0fyoy001)K = \left(\begin{array}{ccc} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \\ \end{array}\right) with 4 DoF (we ignore skew)

KK is trivially invertible (full rank), so inverting it then applying it to a (homogeneous) image coordinate gives us xx', yy', and ww' (which will always just be 1)

This corresponds to a point in space which is exactly one focal length in front of the pinhole (i.e. on the virtual image plane). We don't know the actual depth due to the fundamental depth ambiguity, but we can just do the rotation directly assuming a distance of 1 away.

Rotate that point in 3D space using a normal rotation matrix, then project back using KK (assuming the same camera took both images)

Overall, H=KRK1H = KRK^{-1} is our model, we need to fit this to correspondences using similar technique for homography

Solving for focal length and rotation simultaneously is somewhat difficult. More tractable is estimating ff separately, then solving for RR alone by using an SVD based approach. In particular, the Kabsch algorithm provides an analytic solution to optimize the rotation matrix.

I attempted to find the focal length of my camera by using the image below and doing trigonometry (distance between marks was known, distance between camera and board was known - both are 11 inches, the length of one sheet of A4 paper), and I got a value of roughly ~3000 pixels for my Samsung Galaxy S21 phone

Applying this approach with the above calculated ff to the outside scene, this result was obtained. It is very obviously incorrect.

Choosing a focal length of roughly 1.3 times the original focal length solved the issue (though the artifacting is much worse than that of using a standard homography); this demonstrates that this approach is extremely sensitive to the focal length.

Bells and Whistles: Cylindrical Projection

Goal is to project from plane whose center is ff away from pinhole to a radius ff cylinder around the pinhole. Given an image point (u,v)(u, v), inverting KK (the intrinsics matrix) and applying it to (u,v,1)(u, v, 1) gives (x,y,1)(x', y', 1) which corresponds to a point in space one focal length in front of pinhole.

Then, projecting that camera space coordinate onto the cylinder (with unit radius) is equivalent to (xr,y,1r)(\frac{x'}{r}, y', \frac{1}{r}) where r=(x)2+12r = \sqrt{(x')^2 + 1^2}

Converting to cylindrical coordinates, (sinθ,h,cosθ)(\sin\theta, h, \cos\theta) where h=yh=y', θ=atan2(b,a)\theta = \mathrm{atan2}(b, a), b=xrb = \frac{x'}{r}, a=1ra=\frac{1}{r}.

Finally, when converting to image coordinates - notice that arclength / vertical distances ratio between unit cylinder and cylinder w/ radius ff is just ff by similar triangles. x~=fθ+ox\tilde{x} = f\theta + o_x, y~=fh+oy\tilde{y} = fh + o_y

The inverse warp process is actually in some sense simpler. The domain to consider is just x~[0,2πf)\tilde{x}\in [0, 2\pi f) and y~[0,h)\tilde{y}\in [0,h) where hh is the height of the source image. Then, to get the equivalent position in rectilinear image space coordinates, you compute θ=x~oxf\theta = \frac{\tilde{x} - o_x}{f} and h=y~oyfh = \frac{\tilde{y} - o_y}{f} for all points in the image, compute (x,y,z)=(sinθ,h,cosθ)(x, y, z) = (\sin\theta,h,\cos\theta), then “unproject” back onto the imaging plane by computing (u,v,1)=(xz,yz,1)(u, v, 1) = (\frac{x}{z}, \frac{y}{z}, 1). This is the point to sample from in the source image.

Cylindrical projection is extremely dependent on ff, this is f=50f=50 pixels
While this is f=100f=100 pixels.

Bells and Whistles: Cylindrical Panorama

This is basically just doing the above over and over to gradually build up a large image. In particular, the pairwise correspondences between an ordered set of images is found. The approach I took was to first find correspondences in the unwarped images, then forward warp the coordinates of those points into cylindrical coordinates, though I suppose there is nothing wrong with just finding correspondences between the warped images directly.

Computing a homography between cylindrically projected images is simple, since rotation corresponds to translation in cylindrical coordinates. Thus, it is nearly trivial - just averaging the translation vectors between corresponding points is enough (though I did do RANSAC just to be safe in case of outliers). Leveraging the code from the first part, each warped image is bolted onto the previous result to extend the panorama. Focal length is still a problem, but I got the following result. In particular, it is also quite difficult to get the computational power to perform the panorama due to the extreme slowness of the provided Harris corner detection function, and the focal length / accumulation of error throughout the image as well as my inability to maintain a consistent center of projection due to not having a tripod also leads to some strange looking artifacting throughout the entire image. Overall, though, for such a simple approach, I’m actually pretty happy with the result.

Cool stuff I learned

Application of linear algebra to computer graphics is always a welcome thing to learn about.