A History of Fashion

Through the Lens of Machine Learning

Overview

In this project, I trained three neural networks to extract information about fashion photographs. Then I created a morph timeline video from the photographs and predicted labels, which shows the progression of fashion through each decade alongside historical context. The historical context was extracted using Natural Language Processing machine learning techniques. In the end, the video of both the extracted clothing information, the morph of each photograph into the next, and historical data are presented side-by-side, so that the viewer can draw their own conclusions if history has any affect on fashion. More broadly, the video presents both the changes in fashion and major historical events throughout the decades.

The paper describing the technical details of the project can be viewed here.

Presentation slides accompanying the video can be seen here.

The complete morph video can be viewed on YouTube.

Part 1: Neural Networks

For the first part of this project, I trained three neural networks to classify clothing garments and automatically identify their key-points. Given a photograph of a person wearing a clothing item, these networks return:

A text label for the main item (e.g. Shirt, Dress, Shorts, etc.)
Multiple item attributes (e.g. abstract, silk, sheer, etc.)
A set of landmarks (key-points) of the main item (e.g. left sleeve, right sleeve, etc.).

The clothing label predicted by network (1) belongs to the main item in the photograph. Attributes predicted by network (2) are related to the main clothing item texture, fabric, shape, part, or style. The landmarks predicted by network (3) belong to the item clothing shape. The datasets, training process, and results are described in Part 1 below.

Part 2: Fashion Photographs Data

I used existing publicly-available fashion dataset to train the networks and then ran them on my own data. Fashion data collection was done via web scraping and is described in Part 2 below.

Part 3: Historical Events

For the third part of this project, I created a timeline of historical events from the same time period as the fashion photographs, described in Part 2. To extract historical events, I used existing web data sources such as Wikipedia to collect data on historical context of the 20th and 21st centuries. Historical events collection was also done via web scraping and main summary extraction was done with Natural Language Processing. The details of obtaining historical events data and extracting main events are described in Part 3 below.

Part 4: Morphing Photos

Having the photographs with corresponding clothing labels, attributes and key-points, I created a sequence of photograph morph. Details are in Part 4 below.

Part 5: Timeline Video

Lastly, I put the morph sequence of photographs with labels and categories next to the extracted historical events and created a video timeline from the 1970's until 2022. The timeline video shows one photograph morphing into the next, alongside the timeline of historical summary of articles from the same time period as the fashion photographs. The process of making the video is described in Part 5 below.

Part 1. Neural Networks

I trained three deep neural networks to classify clothing garments, based on the ResNet34 and ResNet18 architectures. The first network was trained to classify fashion garments, described below in Part 1A. The second network was trained to produce multiple clothing tags or attributes of a garment, described below in Part 1B. The third network was trained to produce 8 clothing key-points, described in Part 1C below.

Hardware

I used my own machines for this project running Windows 10 with Nvidia GeForce GTX 980 Ti GPU for training and testing, and a MacBook Air for web scraping and analysis.

To monitor the GPU usage during model training, I used nvitop tool. Running nvitop -m continuously monitors the GPU resources being used during training.

Training Dataset

For this project, I used the DeepFashion dataset for training models that can categorize and label different aspects of fashion photographs.

The dataset is a subset of a larger DeepFashion data set, used for clothing category and attribute prediction tasks. The dataset was collected and published by the Multimedia Lab at the Chinese University of Hong Kong.

The set contains 289,222 images with bounding boxes and labels, containing 50 clothing categories, and 26 clothing attributes. I used version 1.1 of the dataset, released on December 22, 2016.

A. Clothing Categories

The dataset contains 50 distinct clothing categories. The task of predicting clothing categories is a 1-of-K classification problem. The labels are given. There are 48 unique categories (numbered 1 to 48) because category 49 (Shirtdress) and category 50 (Sundress) have been merged with category 43 (Dress).

For Clothing Categories model, I converted the evaluation file into three CSV files with training data (209,222 images), validation data (40,000 images), and test data (40,000 images) for ease of loading the data.

B. Clothing Attributes

The dataset contains 26 distinct clothing attributes. The task of predicting clothing attributes is a multi-label tagging problem. The attributes and images were in file pairs, one file with images belonging to training, test or validation split, and one file with attributes. Each image contains 26 flags, where 1 means a positive label.

For Clothing Attributes model, I converted the evaluation files into three CSV files with training data (14,000 images), validation data (2,000 images), and test data (4,000 images) for ease of loading the data.

Dataloader

From the CSV files, I loaded the data using FastAI Python library ImageDataLoaders for the Clothing Categories model and the DataBlock library for the Clothing Attributes model.

I first transformed each image in the data loader to be square (300x300 pixels). Then, the image was augmented by a random set of augmentations, including rotation, resize, crop, and zoom.

Sample data batch for the Clothing Categories model (with labels as numbers) is shown in the images below.

Sample data batch for clothing category

Click to enlarge.

Sample single augmented image for the Clothing Categories model (with labels as numbers) is shown in the image below.

Sample data augmentation for clothing category

Click to enlarge.

Sample data batch for the Clothing Attributes model is shown in the image below.

Sample data batch for clothing attributes

Click to enlarge.

As evident from the image and labels above, not all labels are correct. For example, the dress in the bottom left corner has a label "no_dress". This means that the trained model will predict attributes with a margin of error.

Training Process

A. Clothing Categories Model

The Clothing Categories model was based on the ResNet34 architecture with pre-trained weights. To train the clothing categories model, I used Python FastAI. After loading the data, I used the lr_find to find the optimal learning rate. The image of the learning rate vs loss is shown in the image below.

I used a variant of a Cross-Entropy Loss (Flattened Loss) function.

Finding optimal learning rate with FastAI

Click to enlarge.

Knowing the optimal learning rate, I trained the model. The suggested optimal learning rate was around 0.0036308. I first fine-tuned the model with this learning rate for 2 epochs. Then, I used the variable learning rate between 1e-7 and 1e-2 and trained the tuned model for 6 more epochs.

The training and validation loss for the model training are shown in the image below.

Categories model training and validation loss curve over time (seconds)

Click to enlarge.

The model accuracy during training is shown in the image below.

Categories model accuracy curve during training epochs

Click to enlarge.

To test whether I should train the model for longer, I doubled the number of epochs in the fine-tuning and fitting stages. Overall, the model accuracy didn't get much better. After fine-tuning the model for 4 epochs and training for 12, the maximum accuracy achieved was 69.6109%. The training loss went down to 0.701990, but validation loss went up to 1.132544, which means that the model is over-fitted.

I kept the model with accuracy around 69.5% shown above to test my data on.

Model Evaluation

For model evaluation, I used the model training for 6 epochs instead of 12, described above.

Examples of evaluating the trained model are shown below.

Categories model evaluation test

Click to enlarge.

In the random batch of 9 images shown above, the model correctly categorized 8 items (green = correct, red = incorrect).

B. Clothing Attributes Model

Similarly to the Clothing Category model, the model for Clothing Attributes was based on the ResNet34 architecture and pre-trained weights. I used a similar data loading and training technique. The only difference between this model and the category one was the loss function. For Clothing Attributes model, I used a smoothed version of the default loss function for multi-label tagging problem, BCEWithLogitsLossFlat. The final model accuracy was 87.0256%.

The training and validation loss for the attributes model are shown in the image below. The model was trained for 10 epochs.

Attributes model training and validation loss curve over time (seconds)

Click to enlarge.

When I trained the model for longer (~20 epochs), the training and validation loss diverge, which is an indication that the model is over-fitting. The point at which the loss rates are starting to diverge is around 10 epochs. For evaluation and testing, I used the fine-tuned model for 10 epochs, where the training and validation losses converged at a low number (shown above).

C. Landmarks Model

To train a key-points neural network on the DeepFashion dataset, I loaded 14,000 training images, 2000 validation images and 4000 test images, their bounding boxes and key-points. Each image has at most 8 key-points. Missing/occluded key-points were marked with 0's.

I trained a neural network using the existing ResNet18 architecture and starting with pre-trained weights for 100 epochs with a learning rate of 0.0001. For data augmentation, I applied a random color jitter and a random rotation between -5.0 and 5.0 degrees. I used an MSELoss for the problem of predicting key-points.

The images were converted to grayscale, cropped, augmented, and resized to 224x224 pixels.

The augmented training samples are shown in the images below. Missing key-points are shown in the top-left corner.

Landmarks model: random augmented sample

Click to enlarge.

Landmarks model: random augmented sample

Click to enlarge.

Landmarks model: random augmented sample

Click to enlarge.

The training and validation loss are shown in the image below.

Landmarks model training and validation loss curve over time

Click to enlarge.

Model Evaluation

Successful predictions are shown below, where most key-points are close to the clothing point of interest (i.e. "left collar", "right collar", "left sleeve", "right sleeve", "left hem", "right hem").

Landmarks model successful evaluation

Click to enlarge.

Landmarks model successful evaluation

Click to enlarge.

Unsuccessful example predictions are shown below.

Landmarks model unsuccessful evaluation

Click to enlarge.

Landmarks model unsuccessful evaluation

Click to enlarge.

Models Discussion and Limitations

Both attributes and category models achieved a high accuracy on the test set (above 69%). However, the collected fashion photograph dataset described in Part 2 varied significantly from the modern-day photographs in the DeepFashion dataset. Another limitation of the training data set was the 26 categories, which were not inclusive of more nuanced photographs or less commonplace items. Nevertheless, the models could quite accurately predict the category of the main item and at least get some categories of the photograph right, if the photo was similar enough for the model to recognize. In general, the model performed well on later-day photographs and didn't perform well on old magazines that were illustrated by hand or scanned from physical copies.

Testing the landmarks network produced OK results. The predictions seemed to do poorly on photographs of models facing away from the camera, and on photos where clothing has a white or light color, perhaps because the color of the clothes is similar to the background.

Part 2. Fashion Photographs Data

I acquired my own fashion photographs online. I wrote a web scraper to programmatically download all of Vogue America covers from https://archive.vogue.com. The time period covered was from the start of the Vogue magazine in 1892 until 2022. Every year has between 10 and 30 magazine covers for a total of 131 years, for a total of 2875 images.

Additionally, I scraped images from the Fashion History Timeline website (https://fashionhistory.fitnyc.edu) from 1900 until 2019. Each decade had between 25-45 images, for a total of 445 images.

My test dataset contained a total of 3320 images.

I prepared the test dataset by converting image filepaths into a line in a CSV file. Running the Clothing Categories model on the prepared fashion dataset produced a CSV file with predicted categories. Running the Clothing Attributes model on the prepared fashion dataset produced a CSV file with predicted categories. Running the Clothing Key-points model on the prepared fashion dataset produced image files with predicted key-points in the folder with key-points on top of the images and a CSV file with file paths, bounding boxes, and predicted key-points coordinates.

Prior to evaluating my own dataset of the trained landmarks model, I used Python cvlib library to automatically detect people and extract bounding boxes around them in each test photograph. This library uses YOLOv3 model trained on the COCO dataset (https://cocodataset.org/) underneath. Only if the detected object had a label "person", the bounding box was extracted to pass to the clothing key-points model.

Sample Vogue magazine covers and Fashion History Timeline photos are shown below.

Vogue Cover, January 1970

Click to enlarge.

Vogue Cover, March 2015

Click to enlarge.

Fashion History Timeline photo (2005)

Click to enlarge.

Part 3. Historical Events

To obtain a list of historical events pertaining to correlate with fashion photographs, I used web scraping and Natural Language Processing techniques.

First, I compiled a list of relevant articles on Wikipedia, describing each decade, starting from the 1970's until 2020's. For each article, I extracted the main content, excluding footer, references, and side navigation menus. Then, I used trained Python NLTK package and data to produce a summary of the article's content. I weighted each word and ranked each sentence by the frequency of the common words used, obtaining an automatic "summary" of the webpage, composed of sentences that had the most frequently used words. I kept 5 top sentences to make up the summary. The extracted summaries are the historical text that accompanies the fashion photograph morph in the final video, shown above.

Sample extracted contect from the Wikipedia 2020's article (https://en.wikipedia.org/wiki/2020s) is shown in the image below.

Extracted Decade Summary from a Wikipedia article (The 2020's)

Click to enlarge.

Part 4. Morphing Photographs

Morphing the photograph followed the same principles we followed in the Morphing Faces project (https://inst.eecs.berkeley.edu/~cs194-26/fa22/upload/files/proj3/cs194-26-ady/). The correspondences were defined by the 8 landmarks returned by the neural network model described in Part 1C, along with 30 points on the bounding box of the person and 4 corner points of the image.

An example of the 8 predicted key-points is shown in the image below.

Landmarks predictions on top of a sample fashion photograph

Click to enlarge.

Out of 3320 test photographs, spanning the years 1892 until 2022, I chose to focus on the later half of the 20th century until the present because the data was very different in appearance from the training data. Therefore, the three models didn't return valid results much of the time. From 1970 until 2022, I had 532 photographs to make a morphing video. When discounting invalid bounding boxes and photographs where key-points landed outside the photograph, there were around 300 photographs left.

Resize each photograph to be the same height (500px).
Crop the wider photograph such that the widths of two photographs are the same and all key-points fall inside the photograph.
Create a triangular Delaunay mesh of the average (mid) photograph key-points.
For each triangle in the mesh, calculate the reverse affine transform to get from source to target triangle.
Warp each triangle and the pixels inside it to get the new shape.
Shade each triangle with linear interpolation.
Repeat for each frame where frame 1 is the source image and frame 45 is the target image, blending both shape and color in each frame proportionally.

If any step above failed, I discarded the photo from the morph sequence. After running the morphing algorithms on all 300+ photos that were left, I was left with over 100 photographs which had a valid boundary. I made the final morph video from these photographs.

An example of the triangular mesh is shown in the image below.

Sample triangular mesh (step 4 above)

Click to enlarge.

Part 5. Timeline Video

Putting all previous parts together and connecting each pair of images morphing into the next in 45 frames, speeding it up to 3 seconds per image pair automaically using Python. Then, I combined the historical context data and the morph video sequences for each decade in iMovie to produce a 4-minute video showing the changes in Vogue and other popular magazines from 1970 until 2022. The video can be viewed on YouTube.

References

Liu, Ziwei and Luo, Ping and Qiu, Shi and Wang, Xiaogang and Tang, Xiaoou, "DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations", Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June, 2016