Tracktention : Leveraging Point Tracking to Attend Videos Faster and Better
TL;DR: A novel architectural component designed to enhance temporal consistency in video models.
How it works
(1) We first sample video features, converting them into a set of track tokens. (2) These tokens are then processed by the Track Transformer module, which updates them over time by propagating and smoothing information along the tracks, enabling robust video representations. (3) Finally, the refined track features are splatted back into the video feature space. By explicitly incorporating motion information through point tracks, our approach improves temporal alignment, effectively captures complex object movements, and ensures stable feature representations across time.
Comparisons to other methods
Compare the results of our method Tracktention (right) with baseline methods (left).
Video Depth Estmiation
Automatic Video Colorization
Method overview
Video Introduction
Acknowledgements
The authors of this work were supported by ERC 101001212-UNION.