Stanford Seminar - Open-world Segmentation and Tracking in 3D
29 Oct 2024 (24 days ago)
Introduction and Motivation
- The focus is on segmentation and tracking in 3D, particularly in an open-world vocabulary fashion, aiming to identify any object in a scene rather than just specific categories like cars or people. (40s)
- The work is centered around three pillars essential for enabling autonomous agents: perception to understand the environment, tracking the movement of any object around the robot, and localization, although the talk primarily focuses on perception and tracking. (1m16s)
- The overarching goal is to understand the dynamics of a scene, with a special emphasis on how objects move, which is crucial for autonomous systems. (2m16s)
From 2D to 3D Segmentation and Tracking
- The approach involves framing tasks from simple to complex, starting with semantic segmentation, which assigns a semantic class to each pixel, and can be done in a supervised manner with existing datasets. (2m53s)
- The discussion includes the progression from semantic segmentation to more complex tasks like panoptic segmentation, which requires a deeper understanding of the scene. (3m49s)
- The task involves identifying and segmenting different people in a scene, assigning semantic labels to pixels, and tracking these instances over time, which requires maintaining consistent IDs across frames. This process becomes complex in crowded scenes with occlusions and similar appearances. (4m1s)
- For a limited number of objects like pedestrians and cars, there are sufficient datasets available to train models in a supervised manner, which works well. However, handling thousands of classes presents a different challenge. (4m53s)
- The goal is to extend segmentation and tracking into the 3D domain, as most current work remains in 2D. Understanding the environment in 3D, and even 4D with the temporal aspect, is crucial for autonomous agents. (5m18s)
- Preliminary results show a method projected onto a LiDAR point cloud, where a car senses and segments various objects like traffic lights and roads, which need to be tracked. (5m57s)
- The ultimate aim is to achieve 4D panoptic segmentation in an open-world setting, which is not feasible with supervised methods due to the vast amount of data required. (6m20s)
Challenges of 4D Open-World Annotation
- Annotating videos for tracking involves repetitive tasks, such as annotating the same person across a thousand frames, which offers little variability in appearance. This makes video annotation significantly more demanding than image annotation. (7m7s)
- The discussion addresses the challenges of collecting extensive data for open-world 4D environments, proposing the use of existing annotated data to create pseudo labels for 3D segmentation and tracking. (8m7s)
Appearance-Based and Motion-Based Methods
- Two methods are introduced: one focuses on object appearance and the other on motion. The appearance-based method, called "Segment Anything in LiDAR," aims to localize objects in 3D and assign semantic meanings using Vision Foundation models. (8m42s)
- The motion-based method involves clustering objects based on motion, similar to classic motion clustering, but in a learned fashion. (9m17s)
- A demonstration shows the model's ability to identify objects like trash cans and fire hydrants, even with limited training data, highlighting the difficulty of training supervised methods with high accuracy due to sparse data. (9m54s)
LiDAR Foundation Approach and Zero-Shot Segmentation
- The model uses a LiDAR Foundation approach, taking LiDAR point clouds and text prompts to detect and classify objects, focusing on panoptic segmentation. (10m31s)
- The approach leverages existing 2D annotations and models to enhance 3D segmentation, using a pseudo label engine called "Sal" that works with two Foundation models: SAM for segmentations and CLIP for semantics. (11m30s)
- The discussion focuses on transferring 2D Vision Foundation model information to 3D labels using a pseudo label engine, enabling a Sal model to perform zero-shot segmentation via text prompting in the LiDAR domain without needing images at test time. (12m14s)
- The model architecture involves processing LiDAR point clouds and text prompts, using a Transformer decoder to decode object queries into instances, producing masks and objectness values. (12m55s)
- The model is enhanced by predicting a CLIP token from object queries, aligning it with embeddings obtained from text prompts, facilitating zero-shot classification by matching 3D masks with CLIP embeddings. (13m36s)
Pseudo Label Engine and 3D Projection
- Training the model requires labels, achieved through a pseudo label engine that distills 2D information from Foundation models using images and LiDAR data, which must be calibrated for accurate 3D projection. (14m40s)
- The pseudo label engine addresses challenges like bleeding effects in naive unprojection by ensuring correspondences between 3D masks and their corresponding CLIP embeddings, enabling supervised training. (16m13s)
- The process of projecting 2D segmentations onto 3D point clouds can result in a mess, with points from objects being projected onto the background, making it difficult to accurately segment objects in 3D space (16m36s).
- To address this issue, a technique called DB scan is applied to the 3D point cloud to cluster points based on their relative distances, which helps to clean up the effects of projection and separate objects from the background (17m9s).
- The DB scan technique is effective in reducing noise and improving the segmentation of objects, such as cars and bushes, in 3D space (17m49s).
- Many papers working on 3D segmentation, especially in indoor scenes, stop at the step of projecting 2D segmentations onto 3D point clouds, but it has been found that training a model on top of these pseudo labels can significantly improve results (18m20s).
- The model is able to distill the good signal from the noisy pseudo labels and achieve a higher accuracy, with the Panoptic Quality (PQ) metric improving from 42 to 70 (19m8s).
- The use of pseudo labels, although noisy, allows the model to learn and improve its performance, and it is not possible to achieve the same level of accuracy with just pseudo labels alone (19m36s).
Evaluation and Semantic Accuracy
- The evaluation of segmentation and semantics is also important, and it has been found that using a semantic Oracle to evaluate class-agnostic segmentation can provide interesting results (20m19s).
- The use of clip tokens can also affect the evaluation results, and it has been found that using clip tokens can improve the performance of the model (20m43s).
- The accuracy of segmentation in 3D without labels is significantly lower compared to supervised models, especially in terms of semantics, which is a known issue similar to what has been observed in 2D models. (20m49s)
- The current approach involves using localization aspects effectively, while the semantic part may require different methods, such as retrieval methods with enhanced features, as current methods using clip features are insufficient. (21m19s)
SAL Model Demonstration and Limitations
- A demonstration of the SAL (Segmentation and Localization) model shows its ability to segment various objects like traffic signs, streetcars, cars, trees, and buildings in single frames, indicating an understanding of context. (21m42s)
- The SAL model operates on LiDAR point clouds, and during testing, it can segment objects like cars and pedestrians by using prompts, although some predictions may have inaccuracies. (22m53s)
- The model can also detect traffic lights and signs, but there can be confusion due to the reliance on geometric information from LiDAR, which makes poles of traffic lights and signs appear similar. (23m31s)
- The model is capable of detecting construction-related objects like barriers and traffic cones accurately, but it cannot handle conditional queries or understand relationships between objects in a scene. (24m45s)
Dynamic Scene Graphs and Prompt Engineering
- The concept of a dynamic scene graph is discussed, where each object is not just a point cloud but includes interactions that can be learned, although prompt engineering is required to ensure consistent results. (25m23s)
- There are challenges with prompt engineering, such as different results when using singular versus plural terms, but consistent results can be achieved with the right approach. (25m41s)
- Errors can occur in object detection, such as detecting a road instead of a sidewalk, which are addressed by cross-checking queries and using intersection and union methods. (26m37s)
Future Work: Open-World Tracking and 360-Degree Coverage
- The motivation for future work includes detecting any obstacle in a scene, which currently requires specifying the object, such as a salmon on the road, to achieve open-world tracking. (27m43s)
- There is a need to segment objects not visible from cameras, which is addressed by creating a "Franken" point cloud that overlaps different point clouds to provide pseudo-labeled 360-degree coverage for model training. (28m53s)
- The discussion involves using a tool for labeling objects in scenes, particularly for applications like self-driving cars, to understand obstacles, sidewalks, and objects that might cross the road. This tool is not intended to be used directly in cars but for labeling and finding interesting data, such as trees in the middle of the road. (30m2s)
Object Relationships and 4D Understanding
- The process of identifying relationships between objects, such as trees and roads, is not automatic and requires rule-based methods. Dynamic scene graphs and graph neural networks are suggested as ways to learn these relationships. (31m26s)
- The goal is to move beyond single-frame predictions to 4D understanding, which involves obtaining temporally coherent instances and tracks of objects, such as cars, over time. This requires accumulating point clouds into a single canonical frame and involves complex processes due to the temporal domain. (32m22s)
- An approach was developed to obtain data in an easy and scalable manner, which involves training a model rather than just using predictions. This method showed significant benefits, although some instances, like debris, remain unrecognizable with specific pattern recognition. (33m33s)
Detecting Dynamic Objects: "What Moves Together Belongs Together"
- The focus shifted to detecting dynamic objects, which are crucial for autonomous agents, by identifying moving objects even if their specific identity is unknown. This led to the development of a method called "What Moves Together Belongs Together." (34m15s)
- The method involves finding clusters of LiDAR point clouds that move together and using these clusters as object instances for training a model. This process is termed motion-inspired pseudo labeling, which requires some labeled LiDAR streams but primarily utilizes large amounts of unlabeled LiDAR data. (34m31s)
- The goal is to train an object detector, such as a car detector, using observed moving objects. The pseudo label generation process involves pre-processing to compute seam flow on the LiDAR point cloud, creating short trajectories for LiDAR points, and clustering these trajectories. (35m15s)
- Clustering is done in a learned fashion rather than heuristically, which is complex due to varying object sizes like cars and people. Once trajectories are clustered, objects can be identified, and bounding boxes can be placed on them for detection. (36m10s)
- The object detector training uses the pseudo-labeled data to train an off-the-shelf detector, applying the method to unseen data and using it as pseudo labels. (37m0s)
Learned Clustering with Graph Neural Networks
- The goal is to learn to cluster point trajectories using graph neural networks, rather than manually or using baselines like DB scan (37m31s).
- A graph is created where the nodes are point trajectories, and the edges connect trajectories that might belong together or not, based on pre-processed lighter sequences and scene flow (37m48s).
- Message passing is used to learn which connections are true and which are not, by sharing features between nodes and edges through a learnable step with an MLP (38m23s).
- The graph neural network is trained for classification, where active edges mean two point trajectories belong to the same object, and inactive edges mean they don't (39m23s).
- Correlation clustering is used for cleaning, and bounding boxes can be extracted (39m38s).
- The method shows a significant improvement in precision compared to using DB scan, with a smaller gap between ground truth and the trained model (40m14s).
- The approach is more general and can be transferred between datasets, such as from Waymo to Argo, and can even detect new classes (41m7s).
- Results on the Waymo dataset show the detected cars as red bounding boxes, demonstrating the effectiveness of the method (41m35s).
Pseudo-Labeling and Generalization
- The approach discussed involves using pseudo-labeling on large datasets to train models, which can yield decent results or serve as a good starting point, even with noisy data. (41m42s)
- The method demonstrates good generalization to unseen classes and open-world vocabulary, despite challenges in clustering certain objects like large trucks or buses. (42m13s)
Key Takeaways and Future Directions
- Key takeaways include the power of pseudo-labeling, leveraging 2D foundation models without retraining for 3D tasks, and the potential of geometric and 3D motion cues for further exploration. (42m40s)
- The aim is to enable 3D segmentation and tracking without requiring labeled 3D data, as labeling in 3D is particularly challenging and should be avoided. (43m31s)
SAM 2 and Limitations
- The interframe mask tracking feature of Sam 2 is noted for its effectiveness, but it struggles with long-term tracking, occlusions, and distinguishing between similar objects. (44m11s)
- Sam 2 is used as a pseudo-labeling method in the presented results, although it has slightly lower semantic accuracy compared to previous versions. (45m5s)
- Sensors are reportedly poor at distinguishing between objects like roadkill and live animals, which may not be a significant concern for vehicle applications. (45m35s)
- The discussion addresses the challenge of distinguishing between a stationary and a moving object, such as an animal, on the road in the context of autonomous driving. It is suggested that the primary focus should be on the localization of obstacles rather than determining if they are alive or dead. (46m0s)
- The integration of motion features with appearance features is considered for detecting moving objects, such as a cat, to ensure that autonomous vehicles avoid them regardless of their state. (47m11s)
- The impact of different types of LiDAR technology on model training is discussed, highlighting that new LiDAR types, like frequency modulation LiDAR, can detect both velocity and position, which could enhance model performance. However, the method may not transfer well across different LiDAR datasets due to variations in point distribution. (47m40s)
Pseudo Labeling and Real-Time Applications
- The concept of pseudo labeling is introduced, which involves using automatic model outputs to generate labels without manual intervention, effectively obtaining labels for free. This method is not designed for real-time applications but aims to improve data understanding and model training. (48m42s)
- The goal is to achieve a better understanding of the data to train faster models, with the potential for optimization by companies like Nvidia, although real-time application is not the primary objective. (49m10s)
3D Model Creation and Simulation
- There is interest in creating 3D models from robot views, which can be used in simulations like Isaac Sim, although this process is not yet fully developed. (49m51s)
- Current work involves reconstructing scenes and adding semantic information, but these are not yet full 3D models. This information can be converted into simulations, and there is a potential for creators using open USD to benefit from this process. (50m21s)
4D Kopic Segmentation and Future Predictions
- The concept of 4D kopic segmentation involves two prediction avenues: completing incomplete objects in 3D and predicting future states over time. This includes methods for forecasting and improving predictions by adding semantic information. (51m22s)
- There is a challenge in accurately modeling objects that may be damaged or altered, such as those on the road. A suggested approach is to use retrieval in image space with appearance and localization techniques, which could improve the accuracy of 3D localization. (52m32s)
Appreciation and Conclusion
- The text includes a moment of appreciation for someone named Laura, followed by applause. (53m53s)