Stanford Seminar - Leveraging Physics-Based Models To Learn Generalizable Robotic Manipulation

26 Nov 2024 (4 months ago)

Introduction

The presentation aims to answer three questions: what is missing in robotic manipulation, what causes this deficiency, and how physics-based models can help, with a focus on learning generalizable robotic manipulation (11s).
Despite impressive robotic manipulation demos, the field is far from being solved, and current policies learned from methods are not generalizable or robust (1m20s).
State-of-the-art algorithms, such as diffusion policy and reinforcement learning, have limitations, including poor generalizability, sensitivity to demonstration data set size, and the need for expensive and tedious tuning processes (2m5s).
The key to understanding the difficulty of manipulation lies in the rich physical constraints that govern it, including collision and robot reachability constraints, contact modes, force closure, and friction cone (3m2s).

Constraints in Robotic Manipulation

Collision and robot reachability constraints are kinematic and geometric, depending on the object's shape and the scene, and are highly non-convex, making it difficult to obtain a global solution (3m45s).
Contact mode describes the contact configuration between the robot, object, and scene, directly impacting the system's dynamics due to hybrid dynamics (4m12s).
In robotic manipulation, there are three types of constraints: environment contact, force disclosure, and friction cone, which are crucial for tasks like grasping and manipulation, but are generally difficult to work with due to their non-convex and differentiable nature (4m28s).
Environment contact constraints depend on the shape of the object and the environment, and can be detected and enforced using tactile force sensing and classical control tools like compliant or force control (4m30s).
Force disclosure and friction cone constraints are mostly concerned with robotic grasping, where force closure describes the sum of external forces and torques equal to zero, and friction cone typically uses the Kum friction model in literature (5m8s).
These constraints are conditioned on the contact points, and there are rich computational tools like quadratic optimization that allow us to leverage these constraints (5m32s).
The common attributes of these constraints are that they are non-convex, mostly differentiable, and computationally expensive to work with, making it difficult to obtain a global solution when solving an optimization problem with these constraints (6m0s).
These constraints are extensively covered by path literature, classical control theory tools, and stem from physics, particularly classical mechanics, kinematics, and dynamics (6m26s).

Leveraging Physics-Based Models

To overcome the challenges in robotic manipulation, physics-based models can be leveraged by moving expensive analytical computations offline, generating datasets or creating heuristics, and solving problems with learning (6m53s).
Learning can be used to refine a network output with these constraints, essentially solving a local problem, and can be applied to challenging manipulation tasks like dexterous grasping, dexterous pre-grasping, and extrinsic manipulation (7m38s).

Dexterous Grasping

Dexterous grasping refers to grasping an object using a multi-fingered hand, which is difficult due to the high degrees of freedom compared to a parallel gripper (8m22s).
Dexter grasping is a powerful tool for grasping objects, allowing for a wide range of different objects to be grabbed, and providing diverse grasping strategies, such as grasping from the top, side, and other configurations (9m9s).
A fully accurate four-finger hand has over 20 degrees of freedom, compared to approximately seven degrees of freedom for a parallel gripper, and is subject to more constraints, including collision, reachability, force closure, and friction cone (8m36s).
A pipeline is proposed to generate Dexter grasping by combining learning and optimization, starting with building a grasp data set with an analytical model, training a generative grasp point predictor, and refining the prediction with local optimization (9m31s).
The generative grasp point predictor is a conditional variational autoencoder that predicts where to place the fingers on an object, and the local optimization problem is solved to satisfy physical constraints, including collision, robot reachability, force closure, and friction cone (9m49s).
The pipeline achieves nearly 90% success rate over 20 objects and 120 trials, with objects ranging from those seen during training to those with similar shapes or never seen before (11m8s).
The pipeline allows for grasping in different configurations, even for the same objects, and enables the robot to use multimodal strategies to pick up objects (11m31s).
Dexter grasping can be used to pick up objects that cannot be picked up with a parallel gripper, demonstrating the power of Dexter grasping (11m54s).
The takeaway of the Dexter grasping task is that a grasp data set can be generated with physics, a grasp predictor can be learned, and the prediction can be refined with optimization that considers friction cone and force closure (12m2s).
The pipeline can also be used to solve for the hand collision, a 22 degrees of freedom problem, using kinematics (12m21s).

Dexterous Pre-grasping

In real-world scenarios, objects to be picked up may be ungraspable and require movement before a graspable position can be reached, and a pre-grasp is necessary to establish a grasp or rigid linkage to the object (12m44s).
Identifying a good pre-grasp is challenging because it is defined by the quality of the grasp it can lead to, and computing a pre-grasp is computationally expensive due to factors like finger movement, object contact, and potential collisions (13m14s).
A key insight is that only one environment contact is needed during a pre-grasp, and only two fingers are required to establish a grasp, allowing for a reduction in search space and the use of model-based methods (13m44s).
A proposed pipeline involves offline training, constructing a contact state graph, and using model-based methods for optimization and synthesizing hand motion (13m57s).
Offline training includes learning a grasp generator and a score function to evaluate pre-grasps, which offloads expensive computations to offline processing (14m12s).
A contact state graph is built based on finger placement on the object surface, and edges represent transitions between contact states (14m49s).
A scoring function is trained to evaluate contact configurations, and trajectory optimization is performed to find the best path on the graph (15m11s).
The pipeline is used to plan a contact transition and synthesize full hand motion with kinematics, resulting in physics-realistic trajectories (15m41s).
The approach is tested in various environments where direct grasping is not possible, and the pipeline achieves efficient pre-grasp planning (16m9s).
The takeaway from the project is the use of a scoring function to rank contact states, a grasp predictor to complete grasps, and a contact mode to guide pre-grasp planning (16m29s).

Extrinsic Manipulation

Extrinsic manipulation is a versatile mode of interaction that involves manipulating an object using environment context, and it's challenging due to various contact configurations and unknown factors like friction coefficients (17m3s).
A divide and conquer approach can be taken by breaking down extrinsic manipulation into primitives based on contact configurations, allowing for the training of robust policies within the same contact configuration using reinforcement learning (17m35s).
The challenge lies in switching between primitives with different contact requirements, and a framework using physical models can be built to enforce these constraints and stitch the primitives together (18m5s).
A primitive library has been built with four primitives: pushing, pulling, pivoting, and grasping, which can be used to describe context and allow the robot to move freely between contact transitions (18m44s).
The framework uses a mixture of classical control tools and learning-based tools, and it can retarget the contact configuration to different scenes and objects using a demonstration and remapping the task (18m33s).
The robot can execute long-horizon tasks, including up to four different contact configurations, using the same primitives by mapping the contact configuration (20m22s).
The framework has been demonstrated to achieve the same task on a variety of different objects and environments using a single demonstration and remapping the contact configuration (20m46s).
A divide and conquer framework can be built using physical models and contact constraints, allowing for the learning of goal-conditioned motion primitives with standard learning tools, and achieving long-horizon tasks that are otherwise impossible (21m1s).

Summary and Discussion

Three tasks were covered: Des grasping, dextrous grasping, and extrinsic manipulation, all of which used analytical methods to make computation more efficient by moving expensive computation offline and ensuring that learning methods satisfy physical constraints at all times (21m22s).
The main issues with robotic manipulation are generalizability and robustness, with current policies being mostly fragile due to complex physical constraints (21m56s).
Physical models can be used to solve these issues through synthetic data set generation, robust constraint satisfaction, composability, and multitask generalization, as shown in the three examples provided (22m13s).
The Gring project only considered finger contact, which is the minimum needed, and the 14th finger was not used, but the same method could potentially work for more fingers (22m54s).
The Preg grasp project required two fingers, but in the book case, only one finger was used, because two fingers plus the environment form a rigid caging, but a patch contact on the environment only requires one finger (23m22s).
The Deen generation pipeline starts with an offline learned grasp generator, followed by a model-based optimization procedure to refine the grasp, and requires a shape or model of the object to run the optimization (24m2s).
The method can be compared to a purely learning-free approach, which can also generate grasp points using a traditional grasp point generator, but the proposed method is more efficient due to the difficulty of global optimization (25m7s).
Learning provides a good local initial guess for robotic manipulation, allowing for diverse grasping configurations even for the same observation, and can also see the initial guess differently to achieve this diversity (25m27s).
The hard part of robotic manipulation is not choosing the initial contact state, but rather the motion planning required to get the object into a state where it can be grasped, which is an accessibility problem as much as a contact problem (26m3s).
The motion planning part is the most important and computationally expensive aspect, requiring evaluation of the entire motion planning process to decide if a pre-grasp is good or not (26m26s).
To address this, a score function is learned to provide a guess of how good a pre-grasp will be without having to solve the full motion planning problem, allowing for offline motion planning (26m47s).
The current success rate of 87% is not enough for real-world applications, and failures are often due to engineering challenges such as the hand overheating or colliding with the table (27m15s).
Improving the success rate from 87% to 95% or higher requires significant engineering work to address these challenges, often referred to as the "last mile problem" (27m58s).
While there are ways to improve the success rate, they may be less principled and require more engineering effort (28m14s).