CS 194-26: Intro to Computer Vision and Computational Photography

Project 1: Images of The Russian Empire

Michelle Mao, CS194-26 | Fall 2020

Overview

I ended up with two functions: align_naive and align. The former is the basic approach that was outlined in the problem description: an exhaustive search over a window of displacements for the R/G channels to find the optimal displacement for alignment onto the B channel. I found that a [-10, 10] window in both horizontal and vertical directions worked well. I implemented both L2 norm (SSD) and normalized cross-correlation (NCC) as a metric, and arbitrarily chose NCC for my final code. I ran both and found no visible difference in output. An extra step I took was to crop both images (image-to-align and image-being-aligned-to) by 10% on each side. This is to crop out the borders during alignment. I did not crop the final image, however, because I thought the high constrast, unclean edges looked cool.

To process larger images (the .tifs), I implemented an image pyramid in the align function. This function scales both input images by a user-determined alpha value (default 0.5) via skimage's rescale function, which performs downsampling, and recursively aligns the downsampled images. I then prematurely shift the image-to-align, such that it is now roughly aligned, and then adjust its alignment with greater percision but at a faster speed than naively aligning by cropping both images by a little less than the alpha value, and perform align_naive on these cropped images. Finally, I return the sum of the two displacements as the final displacement. A zoomed out view of this algorithm is making incremental shifts per level of recursion, each closer to the optimal alignment and each with increasingly smaller input image sizes to run NCC on. The result is a very fast (less than 20s) alignment process on large images (70MB).

I toyed around with the stopping conditions for recursion. I was thinking either a maximum depth or a minimum image size. The former was to prevent large images from processing too long, but this turned out to not matter, as the image size case was hit first. The base case I ended up using is when either height or width is less than 100 pixels. I eyeballed this value and like many other of the guessed values--I was lucky to have chosen a number that produced good results.

One tricky thing: in the second-last row of the results below, there are two versions of emir.tif--the first used the same image processing technique as the others. With a nudge from a friend (shoutout Amy Hung), I learned that this is due to the prominence of blue in Emir's photo. This means that instead of stacking R/G onto B (which had a significantly brighter-looking raw image), it was better to stack R/B onto G. Looking at the raw images, this makes intuitive sense as the middle G channel looked like an "in between" of the top and bottom R/B channels respectively. This was the only issue I ran into.

Results

castle-g[ 2,32],r[ 4,97]	cathedral-g[ 0,52],r[-12,105]	workshop-g[ 0,52],r[-12,105]
harvesters-g[18,60],r[ 18,124]	icon-g[18,40],r[24,88]	lady-g[ 7,48],r[ 10,112]
melons-g[26,93],r[ 13,179]	monastery-g[ 2,-3],r[2,3	onion_church-g[26,50],r[ 37,108]
self_portrait-g[29,77],r[ 37,175]	three_generations-g[16,48],r[ 10,108]	tobolsk-g[3,3],r[3,7]
train-g[ 7,42],r[32,85]	emir-g[23,48],r[-563, 280]	emir-B[-23,-48],R[17,58]
extra1-G[27,50],R[ 28,103]	extra2-G[18,19],R[29,92]	extra3-G[29,77],R[ 37,175]