DELTA: Dense Efficient Long-range
3D Tracking for Any video

Tuan Duc Ngo1,2 Peiye Zhuang1 Chuang Gan2 Evangelos Kalogerakis2,3
Sergey Tulyakov1 Hsin-Ying Lee1 Chaoyang Wang1
1 Snap Inc 2 UMass Amherst 3 TU Crete
ICLR 2025






DELTA captures dense, long-range, 3D trajectories from casual videos in a feed-forward manner.

[Paper]      [Arxiv]      [Code]     [BibTeX]

Abstract

Tracking dense 3D motion from monocular videos remains challenging, particularly when aiming for pixel-level precision over long sequences. We introduce DELTA, a novel method that efficiently tracks every pixel in 3D space, enabling accurate motion estimation across entire videos. Our approach leverages a joint global-local attention mechanism for reduced-resolution tracking, followed by a transformer-based upsampler to achieve high-resolution predictions. Unlike existing methods, which are limited by computational inefficiency or sparse tracking, DELTA delivers dense 3D tracking at scale, running over 8x faster than previous methods while achieving state-of-the-art accuracy. Furthermore, we explore the impact of depth representation on tracking performance and identify log-depth as the optimal choice. Extensive experiments demonstrate the superiority of DELTA on multiple benchmarks, achieving new state-of-the-art results in both 2D and 3D dense tracking tasks. Our method provides a robust solution for applications requiring fine-grained, long-term motion tracking in 3D space.

Motivation


Existing motion prediction methods struggle with short-term, sparse predictions and often fail to deliver accurate 3D motion estimations while optimization-based approaches require substantial time to process a single video. We are the first method capable of efficiently tracking every pixel in 3D space over hundreds of frames from monocular videos, and achieves state-of-the-art accuracy on 3D tracking benchmarks.

Method




DELTA takes RGB-D videos as input and achieves efficient dense 3D tracking using a coarse-to-fine strategy, beginning with coarse tracking through a spatio-temporal attention mechanism at reduced resolution, followed by an attention-based upsampler for high-resolution predictions.

More qualitative results



Comparison with 3D point trackers




More results can be found here.

Comparison with 2D point trackers (lifted 3D with depth)




More results can be found here.

Occlusion problem when using 2D point trackers + Depth


Frontal view


Side view

(Note that all of the methods use the same video depth input)

The baseline (2D Tracker + Depth) struggles significantly with inconsistent video depth estimation, resulting in noticeable jittering effects. Additionally, it fails to accurately track objects during occlusion, as seen from the side view when the ball rolls behind the tree.


More results can be found here.

Application: Non-rigid structure from motion



We first densely track pixels across multiple keyframes in the video to obtain pairwise correspondences. Using these correspondences, we jointly estimate per-keyframe depth maps and camera poses through the Global Alignment in DUSt3R and MonST3R.

More results of video pose estimation can be found here.

Application: Dynamic Reconstruction with Deformable Gaussian Splatting




After densely tracking pixels across multiple keyframes and estimating camera poses, we initialize a static, pixel-aligned 3D gaussian splatting representation. The Gaussians are then deformed along our tracked trajectories to a specific bullet time and rendered from novel viewpoints.

Application: Consistent video editting in 3D space




Click here for comparison with SceneTracker and SpatialTracker on 3D dense tracking.

Click here for comparison with SpatialTracker on the TAP-Vid3D benchmark.

Detailed quantitative results can be found in our paper.