!DOCTYPE html>
For effective image resizing, there exist geometric as well as content constraints. Most methods, like cropping, only address the former; as such, a lot of popular resizing methods result in the loss important content. To address this shortcoming, Shai Avidan and Ariel Shamir introduced a new method called seam carving.
Seam carving, rather than simply remove pixels on the periphery of an image, carves out pixels according to their calculated importance (as calculated by an energy function). One can think about pixel 'importance' as a metric of determining how much a content a pixel contributes as well as how much its absence will be noticed. Intuitively, what this means is that you want to remove pixels that blend with their surroundings. For example, in a vertical portrait photo, the pixels that compose a uniformly colored sky above the subject would have little importance.
The algorithm itself is simple and is presented as the following optimization problem $$s^{*} = \min_{\mathbf{s}} \mathbb{E}(\mathbf{s}) = \min_{\mathbf{s}} \sum_{i=1}^{n} e(\mathbf{I}(s_{i}))$$ $s^{*}$ represents the optimal seam to remove. $\mathbf{s}$ represents the path of a seam. $\mathbf{I}(\mathbf{s})$ represents the pixel of a seam. $e$ is the energy function. To identify the optimal seam we use dynamic programming. Starting from the second row of an image, we compute to cumulative minimum energy $M$ for all possible connected seams for each entry $(i, j)$. Here is the recurrence relation: $$M(i, j) = e(i, j) + \min(M(i-1, j-1), M(i-1, j), M(i-1, j+1)$$ Once the optimal seam has been discovered, remove it and continue the process of calculating/removing until the image has shrunk to the desired dimention.
Presented above is the formulation of the seam carving algorithm but we haven't yet touched on the energy function. Remember, it is the energy function that determines the importance of a pixel! As such, a good energy function and a well-carved result go hand in hand! For my implementation, I used the $L_{2}$ norm of the gradient which, in effect, uses the horizontal or vertical edge contribution of a pixel to determine pixel importance.
Machu Picchu
Vertical Carve 60 pixels
Horizontal Carve 60 pixels
Sunset
Vertical Carve 60 pixels
Horizontal Carve 100 pixels
The Shard
Vertical Carve 40 pixels
Horizontal Carve 100 pixels
Eiffel
Vertical Carve 30 pixels
Horizontal Carve 80 pixels
Downtown Vancouver
Vertical Carve 50 pixels
Horizontal Carve 60 pixels
Amsterdam
Vertical Carve 50 pixels
Horizontal Carve 150 pixels
Sometimes failures are unavoidable. Take for example my photo of the Arc de Triomphe. For the horizontal carve, not only was some of the arch on top cut off but look at the ground :O.
Arc de Triomphe
Vertical Carve 100 pixels
Horizontal Carve 150 pixels
Another failure case was with close-up portrait photos: there's not much that can be removed. Vertical seam removal was fine: the edges were trimmed; horizontal seam removal was not. Here are some awkward photos of Sahai for your enjoyment.
Sahai
Vertical Carve 100 pixels
Horizontal Carve 100 pixels
More failures were seen in photos with center subjects that seemingly extend to infinite. Again, vertical carving was fine but horizontal? Not so much.
Dock
Vertical Carve 100 pixels
Horizontal Carve 100 pixels
I also implemented Seam Insertion. Similar to the process of removal, we want to add low-energy seams. In order to do this, we duplicate the pixels of our optimal seam by averaging it with its neighoring pixels (top/bottom for horizontal insertaion and right/left for vertical insertion). This works well for the insertion of one seam, however, if we were to insert multiple seams in this fashion, it is very likely that we would continue to insert seams in the same location which could lead to very bad artifacts. In order to remedy this, in order to insert $k$ seams, we first find the $k$ seams that we would've removed. We then use these $k$ seams for duplication. In the results section, I've included both successes and failures. Perspective can play a big part in how well an image can be stretched (as seen in the case of the horizontally stretched Arc de Triumphe). The percentage of relevant content might also make seam insertion difficult (as seen in the case of Sahai's lovely portraits). A better energy function might lead to better results
Arch
Vertical Insert 100 pixels
Horizontal Insert 100 pixels
Sahai
Vertical Insert 100 pixels
Horizontal Insert 100 pixels
There is no one-size fits all mold to appropriately resize pictures. Sometimes cropping, if there is little to no content in a photo, might be the best tool for the job; other times, seam carving, if there's no clearly identifiable subject, might be the best way to go. Further, better tuned energy functions would likely help address a few of the failure cases that I ran into; seam carving is a great tool but it's not a magic algorithm: there's a lot of design an intentionality that should go into the selection of an energy function.
As done in the paper, I used a 19 layer VGG network that used Average Pooling instead of Max Pooling layers; each pooling layer marks the end of one of the 5 "blocks" in the Neural Net. For the Convolutional Layers, I used a stride of $1$, padding of $1$, and $3x3$ kernels. The number of our channels, per block, were $64, 128, 256, 512, \text{ and } 512$ respectively.
As briefly mentioned in the background, there are two things that our network must learn: style and content representation. As such, we use a linear combination of two loss functions as the overall to make the overall loss function of the network. For content-loss, we use Squared Error between the feature representations of the original content and semi-transformed image. For Style Loss we use a weighted Squared Error between the featurized Gram Matrix form of the original style and semi-transformed image. The loss equations are as follows $$\mathcal{L}_{content}(\vec{p}, \vec{x}, l) = \frac{1}{2} \sum_{i, j}(F^{l}_{ij} - P^{l}_{ij})^{2}$$ $$\mathcal{L}_{style}(\vec{a}, \vec{x}) = \sum_{l=0}^{L} w_{l} \mathbf{E}_{l}$$ $$\mathbf{E}_{l} = \frac{1}{4 N_{l}^2 M_{l}^2} \sum_{ij}(G_{ij}^{l} - A_{ij}^l)^{2}$$ $$\mathcal{L}_{style}(\vec{a}, \vec{x}) = \sum_{l=0}^{L} w_{l} \mathbf{E}_{l}$$ $$G_{ij}^{l} = \sum_{k}F_{ik}^{l} F_{jk}^l$$ $$\mathcal{L}_{total}(\vec{p}, \vec{a}, \vec{x}) = \alpha \mathcal{L}_{content}(\vec{p}, \vec{x}) + \beta\mathcal{L}_{style}(\vec{a}, \vec{x})$$
$\vec{p}$ and $\vec{x}$ represent the original content image and the image that is generated and $P^{l}$ and $F^{l}$ represent their feature representation in convolutional layer $l$. $G^{l}_{ij}$ is the Gram matrix represents the inner product between the vectorized feature maps $i$ and $j$ in convolutional layer $l$. $\vec{a}$ and $\vec{x}$ are the original style image and generated image; $A^{l}$ and $G^{l}$ are their style representations (in Gram matrix form) in convolutional layer $l$. $\mathbf{E}_{l}$ is represents the unweighted squared error between the two aforementioned matrices. $\alpha$ and $\beta$ are two scalar coefficient hyperparameters that are used to control the content or style loss contribute to the overall loss of the network.
For my optimizer, I used the L-BFGS algorithm with a learning rate of $1$. I trained my network for $6$ epochs with $50$ iterations per epoch. To maximize style transfer, I used a content weight of $1$ and a style weight of $1000000$
Traineing my network for $20$ epochs with $300$ iterations per epoch with a content weight of $0.5$ and a style weight of $20000000000$ yielded even better results. Unlike the results before, which took only 5 minutes to create. The following three images took around an hour. I applied the styles of Van Gogh's Starry Night, Picasso's Seated Nude, and The Shipwreck of the Minotaur.
Here are some other fun blends
In the image below, I tried to transfer Picasso's Guitar Player onto a photo of Slash. However, unlike in Picasso's painting where the contours of the photo actually give form to a guitar player, in the stylistically transfered photo we just get a very sharp and blocky photo.
The representations that convolutional neural networks learn can be used for so much more than dense representations of images. Further, the use of the gram matrix in computing the style loss had me thinking if there exist other matrix forms that store more information than others. For instance, what if we had used the Fisher matrix instead? Overall, I really enjoyed this project and it's made me more fascinated by computer vision.