Colorizing the Prokudin-Gorskii Collection

Phillip Kuznetsov

cs194-26-aea

In this project, we attempt to make a method to automatically colorize the Prokudin-Gorskii photo collection. Sergei Mikhailovich Prokudin-Gorskii was a Russian photographer who was given the Tsar's permission to take photographs across the Russian Empire. Sergei was particularly excited about the potential of color photography as early as 1907. He pioneered a technique that captured 3 black and white image simultaneously using three stacked cameras, each of which were subject to either a Red, Green or Blue filter on the lens that captured them. Thus, we can approximately reconstruct the original color of the subjects by treating each of the black and white images produced as the channels for a color image.
Although we have all the information to properly setup the color pictures, the true problem is that all of the cameras are not perfectly aligned, as you can see in the composite below
We could manually align the channels, but why not be fancy and automatically align them instead?

Baseline - Brute force alignment

I built a baseline alignment method to use for comparison against other methods. The baseline checked a neighborhood of possible different color channel alignments. I translated the image in a range of [-20, 20] for both the x and y coordinates, aligning the green and blue channels with the red channel. I measured alignment by using two different metrics - Normalized Cross Correlation and the L2 Norm (also known as Sum of Squared Differences (SSD). At first, I calculated these metrics across the entire image. After a few failed alignments, I realized that the borders of the image occasionally seemed to hurt the quality of the metric - thus I decided to use a window of the image that kept the center 2/3 and ignored the borders. This alignment method provides a great baseline. It works very quickly on the smaller images, each of which are approximately 400x300 pixels.

Experimental Details

I analyzed two different translation techniques and two different alignment metric for this portion of the project. For translation, I tried using the np.roll function as well as np.pad and observed the resulting images.

Focuse on center of images

Originally, I did not crop the image while calculating any of the metrics. This lead to unfortunate effects where the edges of hte image, which sometimes contained extra noise, writing, and inconsistent weathering, would lead to faulty alignments. To fix this issue, I only used the center 2/3 of the image, as displayed below.
Cathedral
Cropped Area
Edges Included
Edges cropped out
Close up of channel misalignment as a result of using the edges for alignment. The blue shadow in the second image tells us that the blue channel is slightly misaligned.

Translation

The np.roll function takes in a matrix and the vector of translation and then shifts the matrix according to the vector. Any elements of the matrix that are pushed out of the orignal matrix frame by the translation vector are placed immediately on the other side of the matrix. This is the method that was originally suggested, however I wondered if it would be better to pad with zeros instead of rolling the vectors off of the edges. That's why I implemented the np.pad based translation. The motivation here is that np.pad would fill in the missing gaps not with the rolled over elements, but rather zero padding. I believed that this would help with the alignment metric alignment procedure because it wouldn't add extra information on the sides of the image that could potentially negatively impact the quality of the metric. However, on testing I found that this did not actually effect the metric, especially after I added the metric image cropping mentioned above.

Metrics

As mentioned above, I employed two metrics for this first baseline : Sum of squared differences and Normalized Cross Correlation.
Sum of Squared Differences
The sum of squared differences is the simplest metric of the two and measures the raw difference between pixels in an image. The formation is as so, given two vectorized representation of channels, $c_1$ and $c_2$, the sum of squared differences is simply $$ \text{SSD}(c_1, c_2) = \sum_{i} (c_1[i] - c_2[i])^2$$ which is the same as the L2 Norm. To find an optimal alignment, our objective is to minimize the SSD between two target channels. An SSD of 0 means that the pixel intensities are identical.
Normalized Cross Correlation
The normalized cross correlation metric (NCC) uses a different metric that instead uses properties of inner products to measure similarity. NCC is defined below $$\text{NCC}(c_1, c_2) = \left(\frac{c_1}{ \|c_1\|}\right )^T \cdot \frac{c_2}{ \|c_2\|}$$ The basic idea is that two vectors that are identical will have an NCC of 1, while two completely dissimilar images (aka orthogonal) will have an NCC of 0. The objective thus becomes to maximize the NCC. I took the original NCC formulation above and tweaked it slightly to reuse the minimization componnents that I built for SSD. I simply took the inverse of NCC, yielding an objective that should be minimized rather than maximized. $$ \text{NCC}_{tweak}(c_1, c_2) = \frac 1{\text{NCC}(c_1, c_2)}$$
Comparison
SSD
Green Alignment: (-7, -1) Blue Alignment: (-12, -3)
NCC
Green Alignment: (-7, -1) Blue Alignment: (-12, -3)
As you can see by the above result, both metrics performed the same on the smaller image.

Results

Now we showcase the results across for the remaining small images. We compare the image aligned with SSD on the left with NCC on the right.
Settlers
Green Alignment: (-8, 1) Blue Alignment: (-15, 1)
Green Alignment: (-8, 1) Blue Alignment: (-15, 1)
Nativity
Green Alignment: (-4, 1) Blue Alignment: (-8, 0)
Green Alignment: (-5, 1) Blue Alignment: (-8, 0)
Monastery
Green Alignment: (-6, -1) Blue Alignment: (-3, -2)
Green Alignment: (-6, -1) Blue Alignment: (-3, -2)

Pyramid Alignment

Brute force alignment works very well for all of the images small images above, however for higher quality images that are 3700x3000 px, this method falls apart. We'd have to search over a set of translations far larger than [-20, 20], and the 10x larger image will increase the amount of time necessary to calculate the metric as well.
Example of image pyramid, ripped from Wikipedia

Therefore, I took a different approach by utilizing what is known as an Image Pyramid. An image pyramid is a technique commonly used for many different image processing purposes. In my implementation, I began running brute force alignment at a very coarse scale (or in otherwords, when the image is resized to be rather small). This gave me a rough estimate of where the translation should be. Then, I'd take a more fine scale image, rescale the translation found at the previous scale, then use this rescaled translation as the starting point for another brute force alignment. I repeated this process, doubling the scale until I reached the original image and run a final brute force alignment.

I set the base scale so that the coarsest image was at least 128px in height. I played around with smaller scales, however resized images smaller than 128px often lead to terrible alignments, likely because key features became too pixelated to be useful for alignment and simply added noise to the system. The translation search area for each scale of the pyramid was much smaller than the original parameters of the baseline. For each scale, I searched in the 4 pixel neighborhood of the initialization point instead of the 20 pixel neighborhood of the original. Despite this, the image pyramid technique had the potential to traverse much larger areas of the image. The parameters of the brute force search at each level were different than the original parameters. I choose to search a smaller area per scale, [-4,4] pixels from the initialization point. Despite this change, the pyramid alignment ended up searching a much larger area of the image to find a proper alignment. Given a 3700*3000 px image, the algorithm traverses 5 scales, meaning that we can find a translation that is up to $$ 2^4 *5 + 2^3 *5 + 2^2 *5 + 2 * 5 + 5 = 124 \text{ pixels}$$ at a fraction of the cost of doing the same translation using the brute force algorithm. Overall this method provides a fast and scale invariant method of automatically aligning images.

Cost function Evaluation

I ran both the NCC and SSD cost functions for pyramid alignment, and both appeared to perform very well. In Piazza, there were many mentions that the alignment algorithm had issues with the Emir photograph, displayed below. However, on my runs, there appear to be no differences in the images, and there are barely any differences in channel shifts -- if there is any at all.
Three Generations
SSD
Green Alignment: (-111, -11) Blue Alignment: (-59, 3)
NCC
Green Alignment: (-111, -11) Blue Alignment: (-59, 3)
Emir
SSD
Blue Alignment: (-67, -44) Green Alignment: (-57, -17)
NCC
Blue Alignment: (-61, -44) Green Alignment: (-57, -17)
Turkmen
SSD
Green Alignment: (-60, -7) Blue Alignment: (-116, -28)
NCC
Green Alignment: (-60, -7) Blue Alignment: (-116, -28)

Results

Here we output the aligned images using NCC as the cost function and an Image Pyramid for alignment.
Emir
Blue Alignment: (-61, -44) Green Alignment: (-57, -17)
Harvesters
Green Alignment: (-65, 3) Blue Alignment: (-124, -13)
Icon
Green Alignment: (-48, -5) Blue Alignment: (-90, -23)
Lady
Green Alignment: (-62, -4) Blue Alignment: (-116, -12)
Three Generations
Green Alignment: (-59, 3) Blue Alignment: (-111, -11)
Self Portrait
Green Alignment: (-98, -8) Blue Alignment: (-155, -32)
Train
Green Alignment: (-43, -27) Blue Alignment: (-87, -32)
Turkmen
Green Alignment: (-60, -7) Blue Alignment: (-116, -28)
Village
Green Alignment: (-73, -10) Blue Alignment: (-128, -22)
Wharf
Green Alignment: (-98, -11) Blue Alignment: (-80, -25)
Oldman
Green Alignment: (-55, 33) Blue Alignment: (-107, 56)
Riverboat
Green Alignment: (-120, 6) Blue Alignment: (-133, 13)
Factory
Green Alignment: (-18, 2) Blue Alignment: (-26, -8)

Bells and Whistles

Edge Detection

For my bells and whistles, I focused on edge detection for alignment. As a baseline, I used a sobel filter which was suggested on piazza and was rather easy to implement. For the fun of it and buying into the hype, I also wanted to use a completely different edge detector - the first layer of a convolutional neural network.

Sobel Filter

Sobel edge detector simply convolves two different $3\times 3$ filters on an image, which are reproduced below
$$ G_x= \begin{bmatrix} 1 &0 &-1 \\ 2& 0 &-2 \\ 1& 0 & -1 \end{bmatrix} $$
$$ G_y = \begin{bmatrix} 1 &2 &1 \\ 0& 0 &0 \\ -1& -2 & -1 \end{bmatrix} $$
On the left, the fitler for horizontal edges, on the right the filter for vertical edges.
The sobel operator technically acts like a discrete differentation operator. This means that the operator actively identifies the areas in the image that have the highest variance with respect to neighboring variables.

Sobel Results

Emir
SSD - pixels
Blue Alignment: (-67, -44) Green Alignment: (-57, -17)
SSD - Sobel
Blue Alignment: (-107, -40) Green Alignment: (-58, -17)
This is definitely one of the best improvements. If you look at the Emir's turban, you'll see that a yellow hue, an artifact of improper channel alignment, goes away in the sobel image.
Factory
SSD - pixels
Green Alignment: (-18, 2) Blue Alignment: (-26, -8)
SSD - Sobel
Green Alignment: (-18, 2) Blue Alignment: (-26, -8)
The Factory image does not improve at all after adding Sobel. Clearly, many of the images didn't have the same issues as the Emir image.

Learned filter

As a challenge for myself, I wanted to see if I could ride the deep learning hype train into my cs194 homework. Convolutional neural networks are the powerhorse of the deep learning revolution. A convolutional neural network is composed of a series of convolutional layers - a set of convolutional kernels that are adjusted during the training of the network. These weights are adjusted so as to maximize an objective, say classifying a wide range of possible images, as is the case of the ILSVRC competition. If you're interested in learning more about CNNs, I recommend checking out the Stanford course on Deep Learning and enroll in the Deep Learning Decal next semester! On analysis of the filter responses of the first layer, the first layer ends up learning filters that are basically specialized edge detectors. That is why I thought a neural network might be an interesting tool to use to extract edges - I figured the filters used by the network will likely be much more encompassing than a simple Sobel filter. For this filter, I used the first layer of a VGG net trained on ImageNet. This model was trained by the developers of the Keras Deep Learning Library and available as a part of the api. I noted that filters normally traversed all the way through the channels of a colored image. To resolve this issue, I grabbed the slice of the filter corresponding to the specific channel that I wished to align. There are a total of 64 filters for this first convolutional layer. I ended up timing runtime, and was not satisfied with the roughly 16 second runtime for convolving. I decided to skip optimization past the scipy.convolve algorithm (mainly because I was running out of time for submission), and instead take a subset of the filters for aligning the images.

Results

Emir
SSD - Sobel
Blue Alignment: (-107, -40) Green Alignment: (-58, -17)
SSD - Learned
Blue Alignment: (108, -48) Green Alignment: (-57, -17)
This was definitely a negative result. The Learned filter performed much worse for the Emir.
Factory
SSD - Sobel
Green Alignment: (-18, 2) Blue Alignment: (-26, -8)
SSD - Learned
Green Alignment: (-18, 2) Blue Alignment: (-25, -7)
This result shows no visible change and hardly any actual change in comparison to the Sobel filter.
Although the method had promise, my experimentation failed to produce an alignment method that was as efficient and as good as the Sobel operator. However, I don't believe that I've fully experimented with all of the possiblities of this method. I think itd be worth converting the implememtnation to be based in actual Keras rather than loading weights from the framwork. This way we can reduce the runtime necessary by utilizing a GPU to efficiently parallelize computation. I did not have a chance to experiment with different numbers of filters - or a set of all of the filters, so there's still room for a follow up. Additionally, I'm curious what would happen if we tried to go into the filter response of conv2, explore how bias affects the detection and so on.