CS180 Project 4
Theory
, with , both in homogeneous coordinates
This implies that . We can substitute to get the relations
,
,
or equivalently
These equations are not linear in , , but they are linear in , ,
Stacking this into matrix form, we get
Where the left data matrix and right data vector are extended by 2 rows for each additional correspondence. For this to not be underdetermined, we require at least 4 correspondences between the two images. When there are more than 4 correspondences (overdetermined system), we can use least squares to find a good approximate solution.
Preliminary Results - Rectification
For rectification, a flat planar surface in the image was warped onto a rectangular region (hand chosen coordinates). One issue that I faced, particularly with the iPad box, was numerical instability. The data matrix was very ill conditioned and led to a nonsensical projective transform. I verified that even scipy
's own ProjectiveTransform
would not give a good result in this situation. To remediate this, I tried using regularization on the least squares, but all it did was distort the good projections and had almost no effect on the bad ones. The ultimate fix was to restrict the domain, so instead of warping the corners of the image and using those to generate a mask, I instead cropped the region which would be inverse warped and this seemed to help tremendously.
The approach taken to actually perform the warp was very similar to that of the previous project. Basically, the corners of the image were warped using the calculated homography, a mask was drawn, then the inverse transform was applied and interpolation was used before blitting back onto the canvas for the final result.
Preliminary Results - Mosaic
The warping mechanism here is the same as in rectification, but the homography is computed through correspondences rather than using a hardcoded region in the final image as a target. For the blending, I decided to go with using a distance transform based approach, where the alpha in the overlapping region would be calculated using the ratio where was distance to closest non-overlap point in image 1, ditto for image 2. This yielded ok results (slight artifacting due to misalignment), I think I might try multiresolution blending for stage 2 of the project. Most issues were avoided by having uniform lighting / manually keeping exposure constant / being careful when taking the images to maintain a constant center of projection relative to the distances of the objects being photographed
Touched a little bit of grass
A most cool room in San Diego
Part 2: Automatic Correspondence Detection
Corner Detection
For this part, we were allowed to use a pre-implemented version of the Harris corner detector. However, the theory of it is already not too complex; by computing a sum of product of gradients around a patch of the image, then computing its eigenvalues, we can determine if the sum-square intensity difference metric is sufficiently convex around that patch to consider it a corner. (both eigenvalues small → probably a flat region; one eigenvalue small → probably something like an edge). All points above a certain threshold (and not too close to eachother) are marked as interest points and returned from the starter code. This ends up creating an extremely dense set of interest points, which is both slow and impractical for running the later feature matching steps
Adaptive non-maximal suppression
For this part, it implements the adaptive non-maximal suppression algorithm described in the MOPS paper to select interest points which are sufficiently far apart from other interest points which would suppress it due to having a substantially larger corner response. In particular, it calculates for all interest points (i.e. distance to nearest interest point which is substantially larger), then picks top-K to be the new interest points, thereby culling many nearby interest points and distributing them relatively evenly throughout the image.
Feature Extraction
As suggested in the paper, subsampling a larger patch around a given interest point to form a feature descriptor improves the performance of the descriptors. However, to avoid aliasing, the image should be appropriately blurred before performing this step, such that frequencies above the Nyquist limit are removed prior to subsampling. For this basic implementation, an axis-aligned patch is used, and the interest points / features are only detected at a single Gaussian pyramid level.
Feature Matching
Once these basic features have been extracted, the pairwise distance between features in one image and features in another are computed. The ratio of the error for the first nearest neighbor and the second nearest neighbor is computed then used to threshold against (Lowe’s trick), as this provides better discriminatory power than thresholding based on 1-nearest neighbor distance alone.
RANSAC for robust homography estimation
The correspondences calculated in the previous step are quite solid, however there are still some incorrect correspondences / outliers which would seriously impact the quality of a least squares solution calculated directly using these correspondences. To fix this, a randomized consensus algorithm is used where subsets of the matches are picked, a homography (least-squares solution) is calculated, and the number of points which are “well-explained” (i.e. L2 distance between transformed point and ground truth) by that homography are counted as inliers. The homography with the largest number of inliers is chosen as the correct one; the inliers are then gathered and a final least-squares homography is calculated. The warping and stitching then occurs the same as in the first part.
More Results
Results of ANMS on bedroom. Doesn’t really appear to be distributing features super evenly, though many areas (i.e. carpet, wallpaper) are essentially featureless so that might be expected
Much better spread of keypoints throughout the image
Bells and Whistles: Purely Rotation Transformation Model
Camera 1 has extrinsics of where is a block matrix with 3x3 rotation in upper left, and 0's elsewhere except 1 in bottom right
performs projection (implicitly?) by setting to be equal to (rotated and translated)
where is focal length of camera. conversion from physical distance to actual distance is based on sensor (pixels / mm). In theory for pinhole cameras, should be equal to but this doesn't occur in reality due to imperfect sensors having non-square pixels, lens / optics issues, etc. , account for the fact that the center of image is not at (0, 0) but rather top left of image is at (0, 0) by convention.
Intrinsics matrix is thus just with 4 DoF (we ignore skew)
is trivially invertible (full rank), so inverting it then applying it to a (homogeneous) image coordinate gives us , , and (which will always just be 1)
This corresponds to a point in space which is exactly one focal length in front of the pinhole (i.e. on the virtual image plane). We don't know the actual depth due to the fundamental depth ambiguity, but we can just do the rotation directly assuming a distance of 1 away.
Rotate that point in 3D space using a normal rotation matrix, then project back using (assuming the same camera took both images)
Overall, is our model, we need to fit this to correspondences using similar technique for homography
Solving for focal length and rotation simultaneously is somewhat difficult. More tractable is estimating separately, then solving for alone by using an SVD based approach. In particular, the Kabsch algorithm provides an analytic solution to optimize the rotation matrix.
I attempted to find the focal length of my camera by using the image below and doing trigonometry (distance between marks was known, distance between camera and board was known - both are 11 inches, the length of one sheet of A4 paper), and I got a value of roughly ~3000 pixels for my Samsung Galaxy S21 phone
Applying this approach with the above calculated to the outside scene, this result was obtained. It is very obviously incorrect.
Choosing a focal length of roughly 1.3 times the original focal length solved the issue (though the artifacting is much worse than that of using a standard homography); this demonstrates that this approach is extremely sensitive to the focal length.
Bells and Whistles: Cylindrical Projection
Goal is to project from plane whose center is away from pinhole to a radius cylinder around the pinhole. Given an image point , inverting (the intrinsics matrix) and applying it to gives which corresponds to a point in space one focal length in front of pinhole.
Then, projecting that camera space coordinate onto the cylinder (with unit radius) is equivalent to where
Converting to cylindrical coordinates, where , , , .
Finally, when converting to image coordinates - notice that arclength / vertical distances ratio between unit cylinder and cylinder w/ radius is just by similar triangles. ,
The inverse warp process is actually in some sense simpler. The domain to consider is just and where is the height of the source image. Then, to get the equivalent position in rectilinear image space coordinates, you compute and for all points in the image, compute , then “unproject” back onto the imaging plane by computing . This is the point to sample from in the source image.
Bells and Whistles: Cylindrical Panorama
This is basically just doing the above over and over to gradually build up a large image. In particular, the pairwise correspondences between an ordered set of images is found. The approach I took was to first find correspondences in the unwarped images, then forward warp the coordinates of those points into cylindrical coordinates, though I suppose there is nothing wrong with just finding correspondences between the warped images directly.
Computing a homography between cylindrically projected images is simple, since rotation corresponds to translation in cylindrical coordinates. Thus, it is nearly trivial - just averaging the translation vectors between corresponding points is enough (though I did do RANSAC just to be safe in case of outliers). Leveraging the code from the first part, each warped image is bolted onto the previous result to extend the panorama. Focal length is still a problem, but I got the following result. In particular, it is also quite difficult to get the computational power to perform the panorama due to the extreme slowness of the provided Harris corner detection function, and the focal length / accumulation of error throughout the image as well as my inability to maintain a consistent center of projection due to not having a tripod also leads to some strange looking artifacting throughout the entire image. Overall, though, for such a simple approach, I’m actually pretty happy with the result.
Cool stuff I learned
Application of linear algebra to computer graphics is always a welcome thing to learn about.