Stanford Seminar - Living Scenes: Creating and updating 3D representations of evolving indoor scenes
16 Nov 2024 (2 days ago)
Introduction and Motivation
The speaker is from the Civil and Environmental Engineering department, but their team consists mainly of roboticists and computer scientists working to solve problems in the building industry using computer vision (22s).
The team's work focuses on understanding what exists in a space, how it was constructed, and how it changes over time, with the goal of creating sustainable, inclusive, and adaptive built environments that prioritize human needs (46s).
The team is also exploring the intersection of physical and digital spaces, including immersive technologies like VR and AR, and designing physical spaces that allow interaction between the physical and virtual worlds, which they call "gradient realities" (1m12s).
The team's lab is called "Gradient Spaces" and is working on creating and updating digital replicas of evolving indoor scenes, which they call "living scenes" (1m31s).
Creating and Updating 3D Representations of Evolving Indoor Scenes
The concept of living scenes is based on the idea that buildings are like living organisms that evolve over time, and the team is working on developing methods to realistically make, maintain, and update their representations throughout their lifespan (1m53s).
Agents that exist in the environment can navigate, act, and interact with their surroundings, but need to be able to map and understand the environment to do so (2m4s).
The team is working on developing methods to align and merge spatial and temporal data collected by agents to create evolving representations of indoor environments (2m53s).
By acquiring data from agents performing repetitive tasks within a single scene, the team can create a cumulative scene understanding and representation that improves in geometric completeness and accuracy over time (3m20s).
This can enhance interaction with objects within the scene by having a more accurate geometry, and is particularly useful for scenes or parts of scenes that have not been seen before (3m51s).
Creating and updating 3D representations of evolving indoor scenes involves understanding how objects move within the scene and having a foundational understanding of the scene's geometry and semantics (4m7s).
Methods for Acquiring 3D Representations
Two methods for acquiring 3D representations at a single temporal point are Loop Splat and Adaptive Realtime 3D Semantic Understanding (4m41s).
Loop Splat uses the 3D cion Splat representation to reconstruct the scene with great accuracy and can perform Loop detection and Loop closure by registering 3D cion Splats together (4m47s).
Loop Splat's innovation is its ability to minimize drift from the ground truth trajectory by changing the color of the trajectory when Loop closure is detected (5m2s).
Adaptive Realtime 3D Semantic Understanding creates a single map with adaptive quality, meaning certain areas can be in high fidelity while others are in low resolution, depending on user-defined semantics or geometric complexity (5m31s).
This adaptive approach to 3D semantic understanding prioritizes sustainability by collecting only necessary data and reducing resolution in areas of less importance (5m38s).
Relocalizing and Reconstructing Objects in Evolving Environments
For evolving environments, the goal is to relocalize objects within the scene and reconstruct them on an individual object level given sparse 3D observations (6m48s).
A method for achieving this involves instance matching, registering different point clouds, and relocalizing objects within the scene (7m16s).
This method assumes instance segmentation and is robust to noise, as it has been tested with both ground truth and predicted instances (7m29s).
The method takes as input the scene at two temporal points where changes have occurred and aims to reconstruct the scene and relocalize objects (7m20s).
A 3D representation of evolving indoor scenes is created by reconstructing point clouds of objects over time to achieve a more complete and accurate geometry and accuracy of the object instance (7m52s).
The representation is able to solve three tasks - matching, relocalization, and reconstruction - by utilizing different embedding spaces, specifically equivariant and invariant feature spaces (8m6s).
The model is trained only on synthetic data from the ShapeNet database and evaluated zero-shot on a real-world Noy data set, demonstrating its ability to reconstruct unseen parts of objects by understanding geometric priors (8m19s).
The model uses a vector neuron encoder to provide an equivariant and invariant feature space, and a DSDF encoder for shape completion, making it category-agnostic but trained on seven categories from ShapeNet (8m42s).
The equivariant embedding space provides information about the pose of objects in the scene, while the invariant embedding space provides information about the actual geometry and shape of the objects (9m8s).
Qualitative results on synthetic data sets show the model's ability to track objects and reconstruct their geometry and shape over time (9m28s).
The model is also evaluated on a real-world 3RScan data set, which includes multiple scans of the same scene over time, demonstrating its ability to handle temporal changes and reconstruct the geometry and completion of objects (10m1s).
Experiments show that accumulating point clouds from different viewpoints and temporal times improves the model's ability to reconstruct the geometry and completeness of objects and minimize registration error (10m51s).
The model's performance is compared to a baseline, demonstrating its ability to outperform it in terms of geometry and completion accuracy (10m27s).
Scene Graph-Based Representations and Alignment
A second approach to evolving scene representation uses sing graphs, a work done by a student named Shiyen, which was submitted to ICV 2023 (11m27s).
A map can be created in a low-level manner using representations such as occupancy maps, voxel grids, octo maps, hash grids, point clouds, and others, but these methods have limitations, including decision-making taking place on the metric space, which can limit higher-level understanding and generalization (11m46s).
Building a map in a higher level can be achieved using 3D scene graphs, which allow for both high-level and low-level information, enabling decision-making on a more abstract space, and are lightweight and privacy-preserving (12m28s).
D scene graphs are used in agents to build maps on the fly, perform robotic navigation, or task completion, and are a representation that many robotics agents are already benefiting from and using (12m51s).
The goal is to leverage the information that agents are already building in the background to create 3D maps of environments, which can be static or changed scenes with overlaps between zero to partial to full (13m22s).
The SGN Aligner is a method that takes scene graphs as input and performs node matching to identify how two graphs are aligning together, providing a great initialization for tasks such as point cloud registration or point cloud mosaicing (13m45s).
Existing methods for point cloud registration have limitations, including focusing on local feature descriptors, which can lead to issues with changes in the scene, low overlap, point cloud density, and large scenes (14m29s).
The SGN Aligner forms the alignment of scene graphs as multimodal knowledge graphs, which have three types of information within them (15m7s).
SC graphs represent semantic entities in a scene, including object instances with attributes such as category, size, and material, as well as relationships between entities like relative position or attribute similarity (15m19s).
Entity alignment methods from the multimodal Knowledge Graph domain can be redesigned to align spatial maps together, but existing works assume overlapping and accurate information, which is not the case with 3D graphs built by agents (15m51s).
SG liner is a method that takes 3D Sy graphs as input and uses unimodal embeddings, including Point Cloud encoders, structure encoders, and met encoders, to encode modality information separately and interact in a joint space (16m39s).
The goal of SG liner is to align the same object instances closer together and different object instances further apart, enabling tasks such as Point Cloud registration(17m12s).
The method was evaluated on the 3R scan data set and its extension, 3D SSG, and achieved robust performance in matching nodes together, even with predicted sing graphs and low overlap conditions (17m42s).
SG liner also performed well in aligning 3D Sy graphs with temporal changes and varying overlap conditions, outperforming usual 3D Point Cloud registration methods (18m11s).
The method can match nodes together and figure out whether they align together, even with a large number of nodes, and can handle scenarios with 50% or all nodes matched (18m47s).
The performance of the system is improved when at least two nodes are matched for the particular graphs, and this information is used to perform 3D point cloud registration (18m56s).
In contrast to previous approaches, the system registers object instances instead of entire scenes by calculating the registration between object instances that are matched with each other (19m34s).
This approach allows for a more robust and faster alignment of the entire scene, resulting in a 49% improvement in chamfer distance and a 40% improvement in relative translation error (20m4s).
The system also performs well with noisy point cloud predictions and can handle cases with 10 to 30% low overlap (20m17s).
The system can identify overlapping pairs of point clouds more correctly and three times faster than prior art, which is useful for robotics platforms (21m21s).
The system can also handle scenes with zero overlap, where previous methods are not able to perform robustly (21m2s).
The system's geometry-based and semantic-based alignments are designed for scenes that evolve with minimum changes in geometry, such as furniture being relocated or added/removed (21m42s).
Spatiotemporal 3D Point Cloud Registration for Large-Scale Changes
However, the system can also handle more drastic changes in the scene, such as those that occur over time, through spatiotemporal 3D point cloud registration(22m9s).
This approach involves finding pairwise correspondences from the static parts of the scene and excluding temporal changes (22m13s).
The system is tested on datasets with small-scale scenes, such as rooms, and captures standard daily human interaction activities (22m28s).
A student, Tan, has performed work on spatiotemporal 3D point cloud registration, which is an extension of the original system (22m3s).
Existing methods for 3D Point Cloud temporal registration can only handle small changes in the geometry of a scene, such as those found in self-driving car scenarios, but struggle with large changes, like those found in construction sites (22m52s).
Construction sites are a particular scenario where large, drastic changes in the world occur in a small amount of time, making them a challenging environment for existing methods (23m46s).
The "Nothing Stand Still" benchmark was created to evaluate the performance of existing special temporal Point Cloud registration methods on scenes with large changes, such as construction sites (24m7s).
The benchmark dataset was collected from different construction sites over time, using a tripod-based device to capture the scene at multiple temporal points (24m26s).
The dataset includes interior layout construction scenes, with a focus on the slabs, ceilings, walls, and empty spaces, but excludes exterior elements like foundations and excavators (25m36s).
The dataset is challenging due to the large changes in the scene, inconsistent capture over time, and inaccessible areas, making it difficult to align the Point Clouds temporally (25m5s).
The benchmark evaluates both pairwise and multi-way registration of the scene, allowing for the assessment of existing methods' performance on very large scenes (25m21s).
The dataset includes snapshots of the interior layout construction scenes, showcasing the drastic changes that occur over time, from empty spaces to the addition of walls, insulation, pipes, air ducts, and materials (25m54s).
Indoor scenes have repetitive elements, such as studs in walls, which can make registration algorithms struggle to match corresponding elements due to their similar appearance (26m19s).
The environment is also very uniform, with most elements being gray or brown, making it difficult to perform tasks in a robust manner (26m46s).
The data set shows the interior layout being constructed, with changes over time, including the addition of static furniture (27m2s).
Explorations of the meshes in virtual reality demonstrate how spaces change over time and the complexity of the scenes (27m20s).
Pairwise registration methods are typically used, taking two small point clouds as input, performing correspondent estimation, and then using RANSAC and ICP to determine the final transformation (27m36s).
Multi-way registration is also used, connecting point clouds over space and time with edges, and minimizing the weighted RMSD of the poses of the POS graph (28m1s).
Most existing methods struggle to handle multi-way registration, but a recently developed method has shown better performance (28m23s).
Before and after multi-way registration, the best-performing algorithm at the time shows improvement, but still with some failures (28m36s).
The colors in the visualization represent different temporal points, with each color indicating a different point in time (28m52s).
Better algorithms are needed to perform tasks involving large, drastic changes in the environment, and solving this problem could lead to solutions for other related tasks (29m1s).
The assumption is that if this hard problem can be solved, other related problems can also be solved, and more people should work on this using the provided data set (29m10s).
Applications in the Building Industry and Circular Economy
The goal is to create and update representations of evolving indoor scenes, and this technology has potential applications in various fields, including civil and environmental engineering (29m19s).
The building industry is a significant area of focus, with renovation being a major scope in architecture, engineering, and construction (29m46s).
The construction industry aims to increase sustainability and create a circular build environment by reusing materials from demolished buildings in new designs, extending the life of existing buildings, and utilizing existing resources without depleting new ones (29m51s).
Most buildings lack digital information, as computer-aided design software became widespread only after the 1980s, leaving billions of buildings on Earth without digital representations (30m21s).
The construction industry is extremely expensive, with 50% of construction costs increased due to rework, and poor estimations of costs due to a lack of knowledge about the building process (30m55s).
Construction workers accounted for 20% of all occupation fatalities in 2020 in the US, with around a thousand people killed in construction sites due to errors (31m28s).
% of non-hazardous construction and demolition material is either reusable or recyclable, but often ends up in landfills (31m44s).
Understanding spatial and temporal information can have a significant impact on human life and planet sustainability, motivating researchers to work on these problems (32m1s).
The model can be applied to improve the circular economy and reduce construction costs by capturing information about existing materials in buildings, allowing for better planning and harvesting of materials during demolition or new building design (32m50s).
Potential application scenarios include taking down old buildings, capturing information about existing materials, and planning ahead to harvest materials from demolished buildings for use in new designs (32m39s).
The goal of circularity is to plan ahead and know what materials will be available when creating a new building, allowing for the harvesting of materials from demolished buildings and designing with those materials in mind (33m7s).
To understand evolving indoor scenes, it's essential to consider not only new construction but also the deterioration of materials and their current condition, especially in areas with no new construction, and to build and update a map of the space (33m56s).
For these operations, both geometric and semantic information are necessary, which involves understanding where things are and what they are, enabling better planning for sustainable building construction (34m27s).
Knowledge Graphs and Scene Comparison
Knowledge graphs are used to represent relationships between entities, with nodes representing objects categorized using object categories, and relationships including semantic and geometric connections (35m5s).
These relationships can be relative, such as "in front of" or "to the left of," and are typically taken from a particular viewpoint, including object instances, attributes like size and material, and relative relationships within the space (35m39s).
Comparing point clouds or 3D scene graphs to existing CAD or BIM models is challenging due to issues with completeness, scale, and level of detail, making geometric alignment difficult and often inaccurate (36m32s).
While 3D scene graphs might be more robust for alignment, current methods are insufficient, and researchers are working on addressing these challenges (36m14s).
The method can be used to compare a CAD model to a point cloud, allowing for the comparison of a built environment to its designed state, but this is more challenging in construction settings where detailed building information models or CAD models are not always available (37m15s).
The method does not infer any physics from the scene, such as whether an object is brittle or will break if it falls, and assumes that any changes to the scene have already occurred and will not evolve further (37m54s).
The scene graphs created by the method are in JSON format and could potentially be loaded into Unity or other simulation software, but this has not been attempted (38m40s).
The method has not been used for simulations or virtual reality applications, but it could potentially be used for these purposes (39m14s).
Data Sets and Future Work
The data sets used to test the method are existing data sets that were not produced by the researchers, and are available on GitHub(39m36s).
The researchers are working on tools to automate the annotation of 3D geometry in construction sites, but this work is still in its early stages (40m6s).
The method could potentially be integrated with Internet of Things (IoT) devices, such as sensors that store information about the materials and condition of objects in the environment, but this is not currently being explored (40m25s).
The deconstruction process can be enhanced by using living scenes to understand the materials present in a building and what can be harvested, such as the size of panels, windows, and ducts, providing an initial estimate and hypothesis for potential reuse (41m11s).
Connections between materials, such as glues and nails, play a significant role in deconstruction, but this information is often not captured in the data, making it harder to have an accurate estimate of what can be reused (41m47s).
The deterioration of materials behind what is visible is also a challenge, as it is unknown what is happening behind the wall 50 or 100 years later, making assumptions necessary (42m32s).
Workers can use the information from the models to detect which parts can be reused and how to cut them, but an on-the-spot survey is still necessary for accurate information (41m32s).
The approach does not eliminate the need for on-site surveys, but it helps narrow down the best options for new designs by matching the existing building's characteristics (43m27s).
For construction progress monitoring and deconstruction, even a basic sensor would be sufficient, as there is currently a lack of data, but laser scanners are also useful, although they need to be fast and efficient to collect data on a large scale (44m29s).
Current methods for capturing 3D representations of indoor scenes, such as using a laser scanner, can take a long time to complete one rotation, making them not scalable for large areas (44m54s).
Backpack systems are faster and similarly accurate but may have issues with potential drift, and they capture a large number of points that can be hard to process (45m0s).
A method like Adaptive Reconstruction can be helpful for construction progress monitoring as it focuses on the newly installed elements and their correct position and time of installation (45m20s).
Imagery can also be used to capture the installation of elements, but it may not provide perfect point clouds, and sizes could be off; however, it can still provide valuable information on what has been installed and when (45m41s).
Laser scanners with adaptive reconstruction capabilities and the ability to perform fast iterative processing, combined with images, can provide the best solution for capturing 3D representations of indoor scenes (46m1s).
Images are necessary to understand materials, as point clouds cannot provide this information, and high-frequency information from images is needed to characterize materials (46m19s).
Material characterization can be done using visual information from images, but it may not always be accurate, and additional documentation or information on the planned installation of elements can help make this decision easier (47m15s).
The model can also work in non-confined spaces, such as outdoor areas like squares, roads, or parks, where there is no clear ending edge around the whole space (48m25s).
A model can be used to create a 3D representation of an urban area, such as a city, to determine potential locations for solar panels, for example on rooftops, rather than on the ground (48m43s).
Aerial scanning or imagery, such as that provided by Google Street Maps, which has 3D information in certain cities, can be helpful in this task (49m4s).
The Living Scenes model works by doing instance matching, and while it has not been tried outdoors, the Singra work, which operates in the semantic space, would likely generalize more easily to outdoor environments (49m20s).
The Living Scenes model is restricted to being trained on certain categories, which would need to be expanded to accommodate outdoor environments, and one potential workaround is to consider open scene understanding (49m32s).
Open scene understanding is a possible extension of the Living Scenes model, but it has not yet been explored (49m41s).