Abstract

In this project, we tackle and explore the task of action anticipation: given frames of previously observed video clips, predict future actions that will occur in the video sequence. We propose to use an MViT model to first extract attention-based features from video sequences, and apply a transformer encoder based classifier to predict actions. This methodology allows sequential understanding of clips present in the video sequence, while also learning and attending to most important features present in the frames. Our work and results are on the popular action anticipation benchmark Epic-Kitchens-100 (EK100). This dataset offers first-person (egocentric) unscripted recordings of daily activities performed in the kitchen, annotated with corresponding noun-verb action labels. We show results on top-1 accuracy and top-5 class-mean recall for action anticipation, visualize frame-level attention, and discuss next steps to explore.

Key Ideas

Our approach is motivated, in part, by powerful attention-based mechanisms of popular Transformer architectures. Recently, there has been a surge to push for transformers to replace convolutional networks in vision. In our work, we adopt an MViT model and experiment with different configurations, alongside a transformer encoder head. This architecture enables the model to accurately summarize frames from the past and invoke useful features to attain strong performance during the action anticipation phase.

Main Results

We show our results compared to results from RULSTM and AVT, previously SOTA and currently SOTA works respectively. While our model performs better than RULSTM on top-1 accuracy, we still lag behind AVT's top-5 recall scores. We believe that this stems from our current inability to perform end-to-end training due to compute limitations. Above is a visualization of what our backbone MViT network is focusing its attention on. We can see that across frames, the model's attention is scattered: it doesn't really latch on to the same objects in the scene, as we go from frame to frame. As a comparison, attention visualizations in AVT's results show that their backbone network learned to focus spatial attention on the same relevant pixels in the image frames. This suggests to us that end-to-end training, where we can provide the backbone network with information regarding what its downstream label is, will help the backbone network focus its attention better and bump up our scores.

Future Directions

Currently, we are working on obtaining greater compute on another GPU cluster and setting up distributed code, in order to complete end-to-end training in a reasonable period of time. In the future, we also plan on attending to the problem of long term anticipation. This is a more difficult problem regime, in part due to the required efficient remembering and summarizing of past information as well as the effect of multimodal futures, which grows as the anticipation period increases. Overall, we're excited by the trends of our results. MViT (and other transformer models to come) offer the capability to truly explore action anticipation, and hopefully properly attend to long term anticipation as well.

Acknowledgements

I'd like to thank CS194-26 course staff for a wonderful semester! This project was an intensely difficult task to undertake, and many of the difficulties stemmed from having to deal with such a large codebase, and coding with transformers (which is very difficult). I'm extremely grateful for the mentorship and teamwork from Harshayu Girase, Karttikeya Mangalam, and Professor Malik. Additionally, this webpage template was adapted from this wonderful source.