CS 194-26: Final Project

Seam Carving

Background

For effective image resizing, there exist geometric as well as content constraints. Most methods, like cropping, only address the former; as such, a lot of popular resizing methods result in the loss important content. To address this shortcoming, Shai Avidan and Ariel Shamir introduced a new method called seam carving.

Method

Seam carving, rather than simply remove pixels on the periphery of an image, carves out pixels according to their calculated importance (as calculated by an energy function). One can think about pixel 'importance' as a metric of determining how much a content a pixel contributes as well as how much its absence will be noticed. Intuitively, what this means is that you want to remove pixels that blend with their surroundings. For example, in a vertical portrait photo, the pixels that compose a uniformly colored sky above the subject would have little importance.

The algorithm itself is simple and is presented as the following optimization problem $$s^{*} = \min_{\mathbf{s}} \mathbb{E}(\mathbf{s}) = \min_{\mathbf{s}} \sum_{i=1}^{n} e(\mathbf{I}(s_{i}))$$ $s^{*}$ represents the optimal seam to remove. $\mathbf{s}$ represents the path of a seam. $\mathbf{I}(\mathbf{s})$ represents the pixel of a seam. $e$ is the energy function. To identify the optimal seam we use dynamic programming. Starting from the second row of an image, we compute to cumulative minimum energy $M$ for all possible connected seams for each entry $(i, j)$. Here is the recurrence relation: $$M(i, j) = e(i, j) + \min(M(i-1, j-1), M(i-1, j), M(i-1, j+1)$$ Once the optimal seam has been discovered, remove it and continue the process of calculating/removing until the image has shrunk to the desired dimention.

Energy Function

Presented above is the formulation of the seam carving algorithm but we haven't yet touched on the energy function. Remember, it is the energy function that determines the importance of a pixel! As such, a good energy function and a well-carved result go hand in hand! For my implementation, I used the $L_{2}$ norm of the gradient which, in effect, uses the horizontal or vertical edge contribution of a pixel to determine pixel importance.

Results

Machu Picchu

Vertical Carve 60 pixels

Horizontal Carve 60 pixels

Sunset

Vertical Carve 60 pixels

Horizontal Carve 100 pixels

The Shard

Vertical Carve 40 pixels

Horizontal Carve 100 pixels

Eiffel

Vertical Carve 30 pixels

Horizontal Carve 80 pixels

Downtown Vancouver

Vertical Carve 50 pixels

Horizontal Carve 60 pixels

Amsterdam

Vertical Carve 50 pixels

Horizontal Carve 150 pixels

Failures

Sometimes failures are unavoidable. Take for example my photo of the Arc de Triomphe. For the horizontal carve, not only was some of the arch on top cut off but look at the ground :O.

Arc de Triomphe

Vertical Carve 100 pixels

Horizontal Carve 150 pixels

Another failure case was with close-up portrait photos: there's not much that can be removed. Vertical seam removal was fine: the edges were trimmed; horizontal seam removal was not. Here are some awkward photos of Sahai for your enjoyment.

Sahai

Vertical Carve 100 pixels

Horizontal Carve 100 pixels

More failures were seen in photos with center subjects that seemingly extend to infinite. Again, vertical carving was fine but horizontal? Not so much.

Dock

Vertical Carve 100 pixels

Horizontal Carve 100 pixels

Bells and Whistles: Seam Insertion

I also implemented Seam Insertion. Similar to the process of removal, we want to add low-energy seams. In order to do this, we duplicate the pixels of our optimal seam by averaging it with its neighoring pixels (top/bottom for horizontal insertaion and right/left for vertical insertion). This works well for the insertion of one seam, however, if we were to insert multiple seams in this fashion, it is very likely that we would continue to insert seams in the same location which could lead to very bad artifacts. In order to remedy this, in order to insert $k$ seams, we first find the $k$ seams that we would've removed. We then use these $k$ seams for duplication. In the results section, I've included both successes and failures. Perspective can play a big part in how well an image can be stretched (as seen in the case of the horizontally stretched Arc de Triumphe). The percentage of relevant content might also make seam insertion difficult (as seen in the case of Sahai's lovely portraits). A better energy function might lead to better results

Results

Arch

Vertical Insert 100 pixels

Horizontal Insert 100 pixels

Sahai

Vertical Insert 100 pixels

Horizontal Insert 100 pixels

What I Learned

There is no one-size fits all mold to appropriately resize pictures. Sometimes cropping, if there is little to no content in a photo, might be the best tool for the job; other times, seam carving, if there's no clearly identifiable subject, might be the best way to go. Further, better tuned energy functions would likely help address a few of the failure cases that I ran into; seam carving is a great tool but it's not a magic algorithm: there's a lot of design an intentionality that should go into the selection of an energy function.

A Neural Algorithm of Artistic Style

Background

Convolutional Neural Networks (CNNs) are powerful tools that can learn useful representations of images that go beyond raw pixel values. In their Neural Algorithm of Artistic Style, Gatys, Ecker, and Bethge, use this property of CNNs to create a network that can effectively transfer the style of one image to another.

Network Architecture

As done in the paper, I used a 19 layer VGG network that used Average Pooling instead of Max Pooling layers; each pooling layer marks the end of one of the 5 "blocks" in the Neural Net. For the Convolutional Layers, I used a stride of $1$, padding of $1$, and $3x3$ kernels. The number of our channels, per block, were $64, 128, 256, 512, \text{ and } 512$ respectively.

Loss and Optimization

As briefly mentioned in the background, there are two things that our network must learn: style and content representation. As such, we use a linear combination of two loss functions as the overall to make the overall loss function of the network. For content-loss, we use Squared Error between the feature representations of the original content and semi-transformed image. For Style Loss we use a weighted Squared Error between the featurized Gram Matrix form of the original style and semi-transformed image. The loss equations are as follows $$\mathcal{L}_{content}(\vec{p}, \vec{x}, l) = \frac{1}{2} \sum_{i, j}(F^{l}_{ij} - P^{l}_{ij})^{2}$$ $$\mathcal{L}_{style}(\vec{a}, \vec{x}) = \sum_{l=0}^{L} w_{l} \mathbf{E}_{l}$$ $$\mathbf{E}_{l} = \frac{1}{4 N_{l}^2 M_{l}^2} \sum_{ij}(G_{ij}^{l} - A_{ij}^l)^{2}$$ $$\mathcal{L}_{style}(\vec{a}, \vec{x}) = \sum_{l=0}^{L} w_{l} \mathbf{E}_{l}$$ $$G_{ij}^{l} = \sum_{k}F_{ik}^{l} F_{jk}^l$$ $$\mathcal{L}_{total}(\vec{p}, \vec{a}, \vec{x}) = \alpha \mathcal{L}_{content}(\vec{p}, \vec{x}) + \beta\mathcal{L}_{style}(\vec{a}, \vec{x})$$

$\vec{p}$ and $\vec{x}$ represent the original content image and the image that is generated and $P^{l}$ and $F^{l}$ represent their feature representation in convolutional layer $l$. $G^{l}_{ij}$ is the Gram matrix represents the inner product between the vectorized feature maps $i$ and $j$ in convolutional layer $l$. $\vec{a}$ and $\vec{x}$ are the original style image and generated image; $A^{l}$ and $G^{l}$ are their style representations (in Gram matrix form) in convolutional layer $l$. $\mathbf{E}_{l}$ is represents the unweighted squared error between the two aforementioned matrices. $\alpha$ and $\beta$ are two scalar coefficient hyperparameters that are used to control the content or style loss contribute to the overall loss of the network.

For my optimizer, I used the L-BFGS algorithm with a learning rate of $1$. I trained my network for $6$ epochs with $50$ iterations per epoch. To maximize style transfer, I used a content weight of $1$ and a style weight of $1000000$

Results

Traineing my network for $20$ epochs with $300$ iterations per epoch with a content weight of $0.5$ and a style weight of $20000000000$ yielded even better results. Unlike the results before, which took only 5 minutes to create. The following three images took around an hour. I applied the styles of Van Gogh's Starry Night, Picasso's Seated Nude, and The Shipwreck of the Minotaur.

Here are some other fun blends

Failure/Sub-Optimal Result

In the image below, I tried to transfer Picasso's Guitar Player onto a photo of Slash. However, unlike in Picasso's painting where the contours of the photo actually give form to a guitar player, in the stylistically transfered photo we just get a very sharp and blocky photo.

What I Learned

The representations that convolutional neural networks learn can be used for so much more than dense representations of images. Further, the use of the gram matrix in computing the style loss had me thinking if there exist other matrix forms that store more information than others. For instance, what if we had used the Fisher matrix instead? Overall, I really enjoyed this project and it's made me more fascinated by computer vision.