Stanford Seminar - Modeling Humans for Humanoid Robots

12 Dec 2024 (4 months ago)

Introduction and Motivation

The speaker's background is in computer vision, specifically video understanding and action recognition, but they found it becoming repetitive and wanted to explore how video could be useful in other areas, leading them to robotics (42s).
The idea of learning from videos was a motivation to transition into robotics, as humans in videos can provide a good source of demonstrations for robots (55s).
The speaker's personal life, including their daughter, has been a source of motivation for their research, particularly in creating robots that can interact with and assist humans in everyday environments (1m20s).
The speaker's research involves using robots, such as the B1 robot with a Z1 arm, to perform tasks like pick and place, but notes that these tasks are still not very realistic and that real-world scenarios are much more complex (1m32s).
The speaker believes that having a robot in the home can be beneficial, but current robots are often too large and not suitable for smaller living spaces, leading to the need for more compact and agile robots (2m7s).
The speaker argues that bipedal walking is a promising area of research for robotics, as it allows for more efficient movement in tight spaces and is becoming increasingly feasible with advances in technology (3m15s).
The speaker also notes that wheel-based robots can be useful, but often require a large base to maintain balance, which can make them less suitable for smaller spaces (4m1s).
The speaker's research is focused on creating robots that can interact with and assist humans in everyday environments, and they believe that bipedal walking is a key area of research for achieving this goal (4m28s).

Modeling Humans for Humanoid Robots: Data Sources and Challenges

The process of modeling humans for humanoid robots involves discussions on whether to use simulations, real data, or study human data, with no single approach being the best solution (4m30s).
The use of simulation and real data has been explored in projects such as Point C RL, which involves reinforcement learning in simulation and transferring the representation to real-world scenarios (5m14s).
Combining simulated and real data can be effective, but it requires making the task and environment simple to align simulation and real-world data (5m40s).
Simulation to real-world transfer can be challenging, and making the task simple is crucial for successful transfer (5m44s).
Teleoperation is another area of interest, but collecting real data can be expensive, with some businesses spending millions of dollars on it (6m38s).
Simulated data is less expensive but can be difficult to use for complex tasks (6m47s).
Human video data can provide rich context and complex manipulation data, but learning from it can be challenging (7m6s).
Vision-based 3D pose estimation in videos has limitations, but recent advancements in devices like AR glasses have improved 3D estimation accuracy (7m45s).
The H3D dataset from AR glasses provides accurate 3D annotations, but it still relies on motion capture technology (7m57s).
With advancements in vision and better data from human sources, it is reasonable to explore the idea of using human data for modeling humans in humanoid robots (8m17s).

Leveraging Vision-Language Models for Action Prediction

Spatial Region GPT enables vision language models to perform detailed measurement reasoning, allowing them to answer questions about spatial relationships and distances, such as whether a motorcycle can fit through a gap (8m34s).
This capability suggests that these models can be used to predict actions by understanding distances and spatial reasoning, going beyond semantic-level understanding to detailed distance and measurement-level reasoning (9m30s).

Framework for Human-Robot Interaction

A framework is proposed to train a vision language model action model on human data to predict how humans plan and manipulate objects in human action space, with mid-level action descriptions on a trajectory level (10m20s).
The framework involves a two-layer structure, with high-level planning using a vision language model and low-level control using teleoperation or whole-body control by imitating human movements (11m17s).
The low-level perspective involves teleoperation for manipulation and whole-body control by imitating human movements, with different levels of modeling human behavior (11m13s).

Teleoperation and Real-time Control

An example of a teleoperation system, Open Teleoperation, is mentioned, which uses a unique design on the robot's head to stream vision in real-time to a VR glass, allowing for precise control (11m49s).
The system allows a student to see what the robot sees and perform teleop control in real-time, with the potential for applications in robotics and human-robot interaction (12m31s).
Researchers have developed a system that allows a human operator to control a robot hand remotely using a V glass and a V glass estimate hand pose, enabling tasks to be performed with relatively small latency in real-time (12m40s).
The system uses streaming to transmit vision and pose data across the country, allowing for remote control of the robot, and has been demonstrated with a human operator in MIT controlling a robot in San Diego (12m58s).
The system collects data and trains a policy to perform tasks, but the paper notes that the limitation learning part is prone to overfitting, and the system is not generalizing well to new containers or objects (13m31s).
Despite this, the system is able to continuously perform tasks and is robust in particular cases, allowing for long-horizon control (13m51s).
The system uses an active camera that predicts not only the arms and hands but also the head movement, enabling active vision and manipulation (14m6s).
The system has been used to perform tasks such as folding, and has been compared to other teleoperation platforms, showing that the egocentric visual feedback is important for conducting dexterous tasks (15m18s).
The system has also been used to perform tasks that require precise control, such as reaching for a small object, and has demonstrated the ability to perform tasks with a high degree of precision (16m13s).
The researchers conclude that egocentric vision is good for dexterous tasks and remote control, and that the system enables these capabilities (16m39s).

Whole-Body Control and Locomotion

The goal is to enable a humanoid robot to perform whole-body control, allowing it to walk and manipulate objects simultaneously (16m49s).
This work involves a combination of engineering and separate control of the upper and lower body, with the upper body using inverse kinematics (IK) and retargeting, and the lower body using reinforcement learning (17m10s).
A new technique is used to achieve robust walking while moving the upper body, by training a Variational Autoencoder (VAE) to encode upper body motion and using it as a condition for the walking policy (17m42s).
This approach allows for temporal separation of robust walking and upper body control, enabling the robot to perform tasks that require both (17m51s).
Teleoperation results are shown, where one person controls the upper body and another person controls the walking using a joystick, demonstrating the robot's capability to perform tasks such as pushing a wheelchair and pulling a cart (18m17s).
The robot is also shown to be able to walk and grasp objects in a cafeteria setting, with the operator wearing a Vision Pro and controlling the robot's motion (18m52s).
The robot's motion is aligned with the operator's hand motion, and another person uses a joystick to control the robot's direction (19m23s).
The demonstration highlights the potential for humanoid robots to perform complex tasks that require both walking and manipulation (19m39s).
The robot's behavior is not fully learned and sometimes exhibits hesitation, but it is able to perform tasks that require a combination of walking and manipulation (19m58s).
The robot is also able to stand and perform teleoperation tasks, although training the standing behavior took a long time (20m15s).
The robot is capable of standing and remote control, with the control system still being explored and different user interfaces (UI) being tested to find the most effective way to control the robot (21m4s).
The UI allows the user to rotate and move the robot's arm, with the goal of finding an intuitive interface for the human operator to control the robot (21m28s).
The robot is able to walk around the city, not just in a lab setting, using whole-body control to imitate human motion (22m4s).
The input for the robot's motion is a reference motion, with no vision or motion capture (mo cap) data used in this particular case (22m34s).
The robot uses a teleoperation system with a camera view, similar to the Z camera or Z mini cameras, to provide visual feedback to the operator (22m53s).
The low-level locomotion of the robot does not use vision, but the upper body teleoperation has visual feedback (23m5s).
The use of an egocentric camera view helps with fine-grained tasks, as it allows the operator to see the robot's environment from its perspective, making it more intuitive to operate (23m16s).
The egocentric camera view helps to compensate for the lack of haptic feedback and the difference between the human hand and the robot's hand, making it easier for the operator to perform tasks (24m24s).

Hardware Limitations and Data Scaling

Studies have been conducted on the use of haptic feedback in robot control, including the use of tactile sensors to provide feedback to the operator (24m43s).
Humanoid robots have limitations in their hardware, particularly in their fingers, which have only one degree of freedom, making tasks like picking up small things challenging (25m21s).
Scaling up data to improve robot performance is possible, but it depends on the approach used to learn the representation, and a reasonable representation space is necessary (25m51s).

Whole-Body Control with Smaller Robots

A new approach involves not separating upper body and lower body control for whole body control, making it easier to track whole body motions with smaller robots (26m12s).
The X2 robot is a smaller robot that allows for whole body tracking, and it can perform motions like retargeting and simulation policy transfer (26m35s).
The gap between simulated and real motions is not huge with the X2 robot, and it has more degrees of freedom and is lighter, making it easier to control (27m8s).
Despite better hardware, there are still limitations to the robot's capabilities, such as not being able to jump too high or run too fast (27m41s).

Motion Imitation and Data Curation

The framework used for motion imitation involves a teacher-student policy, where the input is kinematics and motion, and the policy tries to track the output joint action (28m23s).
The importance of curating better data sets for human and robot motion is emphasized, and the need to filter and study what kind of motions can be used (28m1s).
Privileged information, such as drawing velocity, is important for whole body control, as seen in a Disney paper on agile whole body control (28m40s).
A map system was initially used to capture the lit velocity of each joint to add input for the body controller, but it was later discarded in favor of a more realistic approach using a teacher-student policy with partial observations (28m53s).
The teacher-student policy involved distilling the teacher's policy and then having partial observations, including proprioceptions, reference motion, and limited partial observations, to perform whole-body control (29m9s).
A significant amount of time was spent on creating a good dataset for this task, with the goal of having the policy track all motions as input instead of overfitting to one particular motion (29m29s).
The dataset was created by manually selecting motions from the CMU dataset, with a focus on having diverse upper body motion and relatively stable walking (30m1s).
Three different datasets were used: D50, which had simple motions; D250, which had more diverse upper body motion and stable walking; and D500, which had a lot of different motions (30m41s).
The CMU Mocap dataset was also used for comparison, and it was found that training on D250 and evaluating on the whole CMU dataset resulted in better generalization (31m4s).
Evaluating on other datasets not from CMU showed that the mid-sized dataset (D250) worked best, while larger or smaller datasets did not generalize as well (31m34s).
The process of selecting and filtering the dataset was manual, but efforts are being made to automate this process using policy to filter out data (32m9s).
The dataset includes invisible trajectories, which are motions that are difficult for the robot to follow, such as exaggerated jumping and kinematic motions (32m34s).

Design Differences and Real-world Applications

The approach to modeling humans for humanoid robots involves a difference in design, specifically in matching the velocity of the root joint and relative positions between human and robot joints, rather than matching absolute positions (34m2s).
This design difference allows for more flexibility and style in movements, such as walking with a specific style, and enables the robot to walk outdoors (34m39s).
The approach also involves using a relative coordinate system, where the human and robot key points are centralized to the same coordinates, allowing for more accurate matching of movements (34m16s).
The robot is able to perform various tasks, such as walking and puncturing, using a specific degree of freedom (35m26s).
Stress testing has been conducted on the G1 robot, which was able to stand for two hours straight without overheating and with 60% battery consumption (35m47s).
The approach is also being applied to navigation tasks, such as vision-language navigation, where the robot is trained to follow instructions and navigate through a space (36m37s).

Vision-Language Navigation

The navigation task involves training a vision-language-action model, which predicts control outputs in the form of language, such as moving forward 75 cm (37m22s).
The approach falls into a two-level framework, where the high-level model predicts language outputs, which are then used to control the robot's movements (37m12s).
The model training involves low-level instructions that specify how much to move forward and how much to turn, which helps prevent overfitting to the navigation task itself (37m33s).
The training process includes co-training with other QA pairs and Vision language model training, allowing the model to learn from various sources (37m51s).
The low-level policy is separated from the high-level policy, enabling them to run at different frequencies, such as 1 Hz for the high-level policy and higher frequencies for the low-level policy (38m8s).
The model can be trained with human videos, general question-answering tasks, and robot data, using the Nvidia VM model and co-training with various data sources (38m47s).
The model is trained with the Real State 10K videos dataset, which involves labeling captions for videos of a camera walking around a room, enabling the model to learn from general video data (39m10s).
The model can be used to train robots to follow instructions, such as walking down a hallway or up stairs, and can be applied to various platforms with a single VA model (39m24s).
The policy rollout process involves sending an image to the robot, computing the text, and sending it back to the robot, although the low-level policy is onboard (39m56s).
The model is not yet fully onboard due to limitations with running V on the B chip, but the team is working with the S Song group to improve this (40m23s).
The model has been tested in various environments, including a home setting, and has demonstrated its ability to follow instructions and navigate spaces (40m37s).
The instruction space is in metric space, allowing for precise instructions such as moving forward 25 units, and the model can handle descriptive action spaces (40m53s).
The model has been used to find objects, such as a football, and navigate through environments, demonstrating its potential for real-world applications (41m16s).
The same model used for navigation can also be applied to question answering, demonstrating its generalization ability and potential to be used as a general model for knowledge about things and outputting actions (41m32s).

Extending the Framework to Manipulation

The vision-language-action model is expected to have reasoning ability, generalization, and the capacity to extend to more complex scenarios, rather than just performing a fixed set of tasks (42m16s).
The framework aims to apply the same model to manipulation, ideally working with high-level policies for hand movement and low-level policies for handling objects (42m37s).
The goal is to train a vision-language-action model on human data and transform the human action outputs to robot outputs, as presented in the Eco V paper (43m24s).
The model was trained on human data from eccentric videos with reliable 3D annotations on hands, and the results showed that the difference between human action and robot action can be addressed through transformation (43m32s).
The model takes human videos, a few frames, and language instructions as input and predicts a parameterized human hand, which can be converted to robot action (44m31s).
A new dataset was created to iterate faster in manipulation, using a simulated environment, and the goal is to find a mapping or conversion between human action and robot action (45m5s).
The conversion between human action and robot action is a simple 3D transformation, from robot R pose to human R pose (45m32s).
The process of modeling humans for humanoid robots involves converting the 3D locations of the robot's hands to the human hand's coordinate system through an optimization process, aligning the robot's fingertips with the human hand's fingertips, and then using an optimization approach to get the human hand parameters (45m55s).
This process creates a mapping to convert human video actions into robot video actions, allowing for the fine-tuning of a human Variational Autoencoder (VA) using the human actions obtained from the robot demonstrations (46m40s).
The human VA is trained without a robot VA, and the robot actions are converted to human actions, enabling the output of the wrist pose and hand pose, which can be mapped back into robot actions (47m6s).
The human hand action can be converted back to the robot hand action by mapping it to the robot's hand RIS, doing inverse kinematics to control the arm joints, and targeting to get the robot hand pose (47m19s).
The human VA output results are visualized, predicting all the joints of the fingers and the wrist, with the model able to predict 10 steps forward and perform action chunking (47m51s).
The model can predict the hand gesture and the trajectory of the hand by predicting the parameterized hand pose and shape, using 15 parameters in the PCA space of the fin hand (48m47s).
The project does not require collecting paired data in simulation and real-world human demonstrations, instead using fine-tuning with relatively accurate 3D pose data from the wild (49m21s).
The initial attempt at simulating human and robot embodiment for learning purposes is being explored, with the goal of developing methods and fast evaluations for cross-embodiment learning (49m55s).
The simulation is not intended for practical use, but rather for studying algorithms and developing methods (50m7s).
The team also tried to make the simulation work with real robot data and real human data, but did not have enough time to complete the task before the CVPR deadline (50m32s).
The team is exploring a similar framework using real human data and real robot data, with the goal of leveraging human data to train a more generalizable agent (51m11s).
The team is also exploring the use of teleoperation and Dexter grasping in simulation, with the goal of developing a more robust and generalizable system (51m29s).
The final milestone for contact and grasping is being explored, with the possibility of using simulation to learn low-level contact and direction instead of complex digital training (51m50s).
The team believes that having a controller, teleoperation, and simulation is not enough to solve complex problems, and that leveraging human data is necessary to train a more generalizable agent (52m3s).
The team emphasizes the importance of training human data and connecting it to robot data, rather than just training a specific robot or agent (52m50s).
The training data is currently too small to enable intelligent actions, but the team believes that with enough data, the system can emerge more complex behaviors (53m27s).

Improving Robot Design and Stability

The team is working to improve the design of the leg and stabilize the vision to enable more robust and generalizable performance (54m5s).
The camera pose could be useful as part of the action space, but it has not been explored yet, and the data from human videos is noisy and lacks accurate camera models (54m33s).
The action model is in the camera coordinate space, following the camera, rather than the robot frame of world coordinates (55m2s).
When collecting new egocentric videos, it would be beneficial to also get accurate camera data with the robot (55m25s).

Comparison with Other Methods and Future Directions

A comparison between X2 and H2O X42 was made, highlighting differences in their methods, such as global checking and velocity tracking (55m46s).
H2O uses global checking of absolute positions of key points, whereas the discussed method uses velocity tracking and local pose tracking (56m9s).
The discussed method also studies the data side, exploring how different data affects the results, and uses a curriculum learning approach (56m26s).
Manually filtering data is still effective, but there is a need to automate the data curation process to enlarge the dataset (56m52s).
The size of the BL models used is around 1.5 billion, and they can run on desktop hardware, but larger models like Long V with 128 frames are not feasible due to long inference times (57m45s).
The current model uses 8 frames as history input and can run at a maximum speed of half a second after synchronization, with some latency due to picture transfer (57m51s).
There is ongoing work to improve the efficiency and real-time capabilities of these models, including collaboration with NVIDIA (58m39s).
Fine-tuning smaller models can result in losing some capabilities, but the current task at hand is not that difficult, and the difference is not significant, especially when trying to show that a short history works well in the current setting (58m49s).
The vision language navigation problem is not well-defined, and a short history can be sufficient, making it unnecessary to have a bigger model or longer context model at this point (59m12s).
There may be future problems that require long context, but currently, it is not necessary to have a larger model or longer context model to address the task at hand (59m23s).
The current task does not demonstrate a significant advantage of using a long context or a larger model, contrary to the initial expectation that a long context model would be more helpful (59m5s).