Stanford Seminar - Robot Learning in the Era of Large Pretrained Models

14 Mar 2024 (1 year ago)

Pre-training Foundation Models for Robotics

Foundation models have access to vast amounts of diverse data and can be adapted to many different downstream tasks in robotics, such as imitation, control, grounding, affordance learning, and intent understanding.
One approach is to pretrain a representation from internet-scale data sources such as human videos, robot interaction data, natural language, and simulation data.
Another approach is to leverage existing Foundation models like large language models or Vision-language models for robotics, using them in creative ways to enhance robot capabilities.

Voltron is a pre-trained visual representation model that uses language and multi-frame conditioning to learn grounded visual representations for robotic tasks.
Voltron combines masked autoencoding with language-driven representation learning, using a captioning loss to condition the encoder on language and incorporating multiple frames to capture context and dynamics.
It also includes a language generation component to describe changes in the image.
Voltron outperforms other visual representation models on language-conditioned imitation learning tasks and can also be used for zero-shot intent inference.

There is a growing effort to collect large-scale robot datasets to train and evaluate pre-trained visual representation models for robotics.
OpenEx Embodiment is a project that brings together roboticists from different institutions to create a single, large-scale cross-embodiment robot dataset.
Droid is another effort that focuses on collecting large-scale in-the-wild robot data using a standardized platform with the same robot, gripper, and cameras.
The goal is to train large models on this data to improve robot performance on a variety of tasks.

Determining what type of data is useful for training is a challenge, as diversity and data quality are important factors.
A recent study found that adding more data to a dataset did not always improve performance, suggesting that data quality is more important than quantity.
Multimodality in data can confuse the policy, making it challenging to determine the appropriate action.
Pre-training large models for robotics requires collecting a large amount of robot data, but data collection should focus on quality, such as action consistency and diverse states, rather than passively collecting large amounts of data.

Adaptation is crucial for improving the efficiency and task rate of pre-trained models, as a single unified model may not generalize well to all scenarios.
Densely narrated training data for Voltron, with detailed and fine-grained language motions, allows for better generalization and understanding of specific motions, enabling their application to other tasks that require similar motions.