CS 294-26: Intro to Computer Vision and Computational Photography, Fall 2022

Project 1: Images of the Russian Empire: Colorizing the Prokudin-Gorskii photo collection

Katherine Song (cs-194-26-acj)



Overview

In this project, we applied image alignment and automatic contrast techniques to generate a fully colored image from three separate exposures of a scene through a red, green, and blue filter. Even though the exposures are of the same scene, they are naturally slightly offset from one another, and the goal of this project was to create a program to calculate the needed displacement of the red and green channels with respect to the blue one to generate a correctly-aligned color image (we ignore rotation for now). The dataset used comprised selected images from the Library of Congress's Prokudin-Gorskii collection, which contains digitized RGB glass plate negatives from Sergei Mikhailovich Prokudin-Gorskii's photographs from the early 1900s documenting scenes from the end of the Russian Empire.

Approach

Single-Scale Implementation

As suggested by the project description, I first created a function for single-scale image alignment using for loops and applied it on the red and green channels to align them with the blue channel (a bit of experimentation later suggested that aligning to blue was perhaps not always the most reliable choice when using RGB raw pixel data given this image set. However, to be honest, I wasn't sure if that was a parameter that was allowed to be changed, and after switching to aligning based on edges instead of raw pixel value as described later, this wasn't much of an issue anyway). This function takes in 2 color channel images and aligns the first one to the second. It has a window input parameter to specify the width of the window, in pixels, that are candidates for possible (x, y) displacement values. For single-scale implementation on the smaller JPEG images, this window needed to be [-15 15] for proper alignment.

First, 10% of the pixels on all sides of each of the two input images are discarded for alignment. Without this step, the noise in the borders of the images (i.e. parts of the image that weren't "core" to the image content, such as artifacts from the edges of the color filters, for example) would sometimes throw off the calculated displacements. Next, we loop over every possible (x, y) displacement of an image in the given window and score the accuracy of that displacement with a metric. I tried both sum of square differences (SSD) and normalized correlation coefficient (NCC) as metrics and found that SSD was actually the better metric for this image set. SSD is calculated as the sum of the square of the difference in raw pixel values between the shifted image (red or green channel) and the reference image (blue channel). Finally, the displacement corresponding to the smallest SSD is chosen as the best final displacement, and the red or green image is shifted and adjusted accordingly. The final colored image is obtained by simply combining the shifted red and shifted green channels with the original blue into a single image.

One major problem with this approach is that if we were to run it on the larger TIFF images, the window would need to be on the order of [-150 150] in x and y, which would take a prohibitively long time, a problem which we solve subsequently. On a practical note, I chose to have the function itself return the (x,y) displacement instead of the processed image so that I could conveniently reuse and call it in my multi-scale implementation.

Multi-Scale Implementation

For the larger images (though it works for the small images as well), I wrote a multi-scale alignment function that speeds up alignment time by utilizing image pyramids. An image pyramid is a list containing multiple scales of a given image. I chose to use a scaling factor of 2, meaning that a single pyramid is composed of the original image in a given color channel, the original image scaled by 0.5, the original image scaled by 0.25, etc. I stopped building the pyramid when either the rows or columns became fewer than 32 pixels, a threshold I determined by trial and error. If the top of the pyramid is too small, too many features are lost, and the reduced images cease to actually accurately represent the original image; if the top of the pyramid is too big, we start to lose the benefit of this approach because our search window size has to be made bigger to maintain accurace. The power of this approach is that we can make the possible window of displacements at each level of the pyramid very small, drastically reducing the number of operations needed for a given image. Instead of making comparisons over a huge (150x150 pixels for some images in this set!) pixel window for the huge original image, we start with comparisons on very small, ~32x32 pixel images for which a very small (2x2) window is a sufficient search area. Once the best estimated displacement is made for the smallest image in the pyramid with the single-scale image alignment function described previously, that displacement is used as the starting point for the next image in the pyramid (i.e. an image 2x in size of the smallest, corresponding to a need to scale the estimated displacement by 2 as well). Because the displacement from the last level of the pyramid is somewhat "close," a small 2x2 window is again a sufficient area to refine our estimate. This process is continued until we finally have the best estimated displacement for the original image size; even for images with dimensions 4096x4096, this process is only repeated 8 times.

Time for aligning both red and green channels to the blue channel for the JPEG images is ~0.1s (compared to ~0.5s for the single-scale alignment), and time for aligning both channels for the TIFF images is ~8s (compared to ~5 hours for single-scale implementation with a [-150 150] window). As a sanity check, I ran both the single-scale implementation with a window size of [-150 150] and the multi-scale implementation with a window size of [-2 2] on a couple of large TIFF images and confirmed that they yielded the same displacements.

Bells and Whistles

Automatic Contrasting

The first extra feature I implemented was basic automatic contrasting. I did this by first rescaling image intensities such that the darkest pixel is 0 and the lightest pixel is 1 - i.e. image = (image - min(image))/(max(image)-min(image)). I then applied a gamma correction with factor 1.5 using skimage's built-in skimage.exposure.adjust_gamma function. I selected 1.5 empirically, as higher values led to images that in my opinion had a little too much contrast, and features in shadows in particular were lost (especially noticeable on images like melons.tif). With more time, I would ideally also have corrected for color, as a few images, such as church.tif, were clearly a little unbalanced. A few before (left) and after (right) images are below:
harvesters.tif
self_portrait.tif
icon.tif

Edge Detection and Alignment

The second extra feature that I implemented was alignment based on edges instead of RGB raw pixel value. For this, I simply used skimage's built-in Scharr filter and then used the filtered images as input to my alignment function. This was sufficient to fix the 2 images that grossly failed under the RGB raw pixel multiscale alignment algorithm described above:
church.tif, RGB raw pixel value alignment.
church.tif, recolored + edge alignment.
emir.tif, RGB raw pixel value alignment.
emir.tif, recolored + edge alignment.
For the 2 above images, the R and G color channel pixel values are not actually well-correlated with those in the blue channel. For example, in emir.tif, the emir's jacket, a prominent feature in the image, is very blue, corresponding to very high values in the blue channel but low values in the red and green channels. Thus, trying to minimize error between red/green and blue channels is a poor approach. On the other hand, edge detection looks for deltas in pixel values, and in the case of emir.tif, the red, green, and blue channels all indeed feature prominent edges that correspond to the same features, as seen below:
emir.tif. Red channel, Scharr edge filter
emir.tif. Green channel, Scharr edge filter
emir.tif. Blue channel, Scharr edge filter

Similarly, church.tif is a fairly blue-heavy image, and some light features in the red channel in particular actually should be aligned with dark features in the blue channel, which is the opposite of what alignment based on raw pixel values optimizes for. In theory, edge detection is a more reliable approach.

However, I did notice that for the images for which both alignment schemes "worked," there were slight (+/- up to 3 pixels) differences in final displacements between edge-based alignment and RGB pixel value-based alignment. These were fairly indistinguishable for the most part. For some, the RGB raw pixel alignment was slightly better, indicating that for those images, the raw pixel values in the different color channels was highly correlated, moreso than the edges. For example, below are two zoomed-in captures from three_generations.tif. In the end, neither alignment scheme completely gets rid of all color visual artifacts at the edges -- my hypothesis is that a simple (x,y) translation isn't sufficient to completely get rid of all visual artifacts.

RGB pixel value-aligned. R: [14 53], G: [11 112]
Edge aligned. R: [12 54], G: [8 111]
I would say that the only image for which alignment based on RGB raw pixel values clearly outperformed edge-based alignment was train.tif (below), for which the raw RGB brightness values are similar to one another. Another thing I noted about this particular image is that the edges in the blue channel were not strong except for between the sky and the body of the train and the trees, which makes it plausible why aligning by edge might underperform slightly. In the end, though, while the alignment is only a couple of pixels off, it is visually more obvious because of the misalignment of text on the train (our eyes are more sensitive to artifacts in text than in high-frequency image material like trees, etc.).

RGB pixel value-aligned. R: [6 43], G: [32 87]
Edge aligned. R: [8 42], G: [29 85]

In the next section, I stick with my edge based alignment for presenting my results, because in my opinion, the occasional slight artifact from edge-based alignment is far less offensive than the gross misalignment of some of the raw RGB pixel value alignments. Some future improvements to my approach include adding a step to use RGB raw pixel alignment if the SSD on the raw pixel values between two input channels falls below a certain threshold, using a combination of RGB raw pixel value and edges for alignment, or maybe simply using a better edge enhancement algorithm.

Results: Example Images

Below are the results of my algorithm (including the bells and whistles) on the provided example images. Note that displacements are reported for the red and green channels in the format [x y], corresponding to the [column row] shift.
cathedral.jpg. R: [2 5], G: [3 12]
church.tif. R: [4 25], G: [-4 58]
emir.tif. R: [24 49], G: [40 107]
harvesters.tif. R: [17 60], G: [ 14 123]
icon.tif. R: [17 42], G: [23 90]
lady.tif. R: [9 56], G: [13 119]
melons.tif. R: [10 80], G: [13 176]
monastery.tif. R: [2 -3], G: [2 3]
onion_church.tif. R: [26 51], G: [35 107]
sculpture.tif. R: [-10 33], G: [-26 140]
self_portrait.tif. R: [29 78], G: [37 175]
three_generations.tif. R: [12 54], G: [8 111]
tobolsk.jpg. R: [3 3], G: [3 6]
train.tif. R: [8 42], G: [29 85]

Results: Self-Chosen Images

Peonies. R: [ 3 52], G: [-5 104]
Sunset. R: [-41 75], G: [-69 114]
Vase. R: [-2 24], G: [ -2 113]
On the Ordezh River near the Siverskaia Station. R: [-1 39], G: [-7 151]