Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better

TL;DR: A novel architectural component designed to enhance temporal consistency in video models.

How it works

(1) We first sample video features, converting them into a set of track tokens. (2) These tokens are then processed by the Track Transformer module, which updates them over time by propagating and smoothing information along the tracks, enabling robust video representations. (3) Finally, the refined track features are splatted back into the video feature space. By explicitly incorporating motion information through point tracks, our approach improves temporal alignment, effectively captures complex object movements, and ensures stable feature representations across time.

Comparisons to other methods

Compare the results of our method Tracktention (right) with baseline methods (left).

Video Depth Estmiation

Automatic Video Colorization

Method overview

A diagram explaining the method in broad strokes, like explained in the caption.

Left: the Tracktention architecture comprises Attentional Sampling, pooling information from images to track, Track Transformer, processing this information temporarily, and Attentional Splatting, moving the processed information back to the images. Right: Tracktention is easily integrated in ViTs and ConvNets to make video networks out image ones.

Video Introduction

Acknowledgements

The authors of this work were supported by ERC 101001212-UNION.

BibTeX

@inproceedings{lai25tracktention:,
    author = {Zihang Lai and Andrea Vedaldi},
    booktitle = {Proceedings of the {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})},
    title = {Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better},
    year = {2025}
}