MAST: A Memory-Augmented Self-supervised Tracker

Zihang Lai, Erika Lu, Weidi Xie

Visual Geometry Group, Department of Engineering Science, University of Oxford

CVPR 2020


DAVIS-2017 Video Segmentation


Youtube-VOS 2018 Video Segmentation

Abstract

Recent interest in self-supervised dense tracking has yielded rapid progress, but performance still remains far from supervised methods. We propose a dense tracking model trained on videos without any annotations that surpasses previous self-supervised methods on existing benchmarks by a significant margin (+15%), and achieves performance comparable to supervised methods. In this paper, we first reassess the traditional choices used for self-supervised training and reconstruction loss by conducting thorough experiments that finally elucidate the optimal choices. Second, we further improve on existing methods by augmenting our architecture with a crucial memory component. Third, we benchmark on large-scale semi-supervised video object segmentation(aka. dense tracking), and propose a new metric: generalizability. Our first two contributions yield a self-supervised network that for the first time is competitive with supervised methods on standard evaluation metrics of dense tracking. When measuring generalizability, we show self-supervised approaches are actually superior to the majority of supervised methods. We believe this new generalizability metric can better capture the real-world use-cases for dense tracking, and will spur new interest in this research direction.

@inproceedings{Lai20,
  title={MAST: A Memory-Augmented Self-supervised Tracker},
  author={Lai, Zihang and Lu, Erika and Xie, Weidi},
  booktitle={CVPR},
  year={2020}
}
Video
Downloads
Please contact zihang.lai at gmail.com if you have any questions.
Results

.

Video segmentation results on Youtube-VOS 2018 dataset.

Video segmentation results on DAVIS-2017 dataset. Higher values are better.

Acknowledgements
The authors would like to thank Andrew Zisserman for helpful discussions, Olivia Wiles, Shangzhe Wu, Sophia Koepke and Tengda Han for proofreading. Financial support for this project is provided by EPSRC Seebibyte Grant EP/M013774/1. Erika Lu is funded by the Oxford-Google DeepMind Graduate Scholarship.