CoTracker3: A Point Tracker Using Real Videos

Company

Voxel51

Date Published

Oct. 22, 2024

Author

Harpreet Sahota

Word count

2233

Language

English

Hacker News points

None

URL

voxel51.com/blog/cotracker3-a-point-tracker-using-real-videos

Summary

CoTracker3 is a point tracking model designed to predict individual points' trajectories throughout a video sequence, even when they are occluded or out of the camera's view. It stands out due to its ability to effectively leverage real-world videos during training, resulting in state-of-the-art performance on the point tracking task. CoTracker3 overcomes limitations faced by other SOTA point trackers that rely heavily on synthetic datasets for training by employing a semi-supervised training approach incorporating unlabeled real-world videos. This is achieved by using multiple existing point trackers (trained on synthetic data) as "teachers" to generate pseudo-labels for the unlabeled videos, allowing CoTracker3 to bridge the gap between synthetic and real-world data distributions. The model comes in two flavors: online and offline, with the latter offering better performance, especially for tracking occluded points by interpolating trajectories through occlusions using the entire video context. To run inference with CoTracker3 and parse the output into FiftyOne format, one can follow these steps: 1. Install the required libraries (FiftyOne and imageio[ffmpeg]). 2. Download a dataset from Hugging Face or another source. 3. Explore the dataset using the FiftyOne app. 4. Choose between online and offline modes for CoTracker3, with the latter being used in this tutorial. 5. Download the model from the Torch hub. 6. Prepare a video for inference by converting it to a tensor. 7. Load the video as a tensor and move it to the GPU if available. 8. Set the grid_size parameter, which determines the number of points tracked within a video frame. 9. Run inference using the model and obtain two tensors: pred_tracks (point coordinates) and pred_visibility (point visibility). 10. Visualize the model output using the CoTracker repository's visualization code. 11. Parse the model output into FiftyOne format by converting it to keypoint objects for each frame, adding them to the sample at once, and updating the sample with a "tracked_keypoints" field containing an fo.Keypoints object for each frame. 12. Run inference on the whole dataset using the appropriate batch size and grid_size parameters. 13. Set configurations for the FiftyOne app to visualize the output, such as coloring by instance and looping videos. By following these steps, one can effectively use CoTracker3 with FiftyOne to track points in video sequences and visualize the results using the FiftyOne app.