VIHE: Transformer-Based 3D Object
Manipulation Using Virtual In-Hand View


VIHE is a transformer-based imitation learning agent that leverages rendered virtual in-hand views for accruate 6-DoF action predictions.


Video

Abstract

In this work, we introduce the Virtual In-Hand Eye Transformer (VIHE), a novel method designed to enhance 3D manipulation capabilities through action-aware view rendering. For each action step, VIHE autoregressively refines a prediction of hand keypoints in multiple stages by conditioning on virtual in-hand views from the predicted hand pose in the previous stage. The virtual in-hand views provide a strong inductive bias for effectively recognizing the correct pose for the hand to be placed, especially for challenging high-precision tasks.

On 18 manipulation tasks in RLBench simulated environments, VIHE achieves the new state-of-the-art with 12% absolute improvement from 65% to 77% over the existing SOTA model using 100 demonstrations per task. Furthermore, our average success rate with only 10 demonstrations per task matches that of the current SOTA methods which use 100 demonstrations per task, making our approach 10 times more sample-efficient.

In real-world scenarios, VIHE can learn manipulation tasks with a handful of demonstrations, highlighting its practical utility.


VIHE sets the new SOTA in multi-task evaluation on 18 RLBench environments.

VIHE


Starting with RGB-D images from multi-view cameras, we first construct a point cloud of the scene. Global views are first rendered using fixed cameras positioned around the workspace. From these global views, the network outputs initial action predictions a0pose, a0open, a0col. Then at each refinement stage i, we autoregressively generate virtual in-hand views from cameras attached to the previously predicted gripper pose ai−1pose . Based on the rendered views, we then refine the action predictions. The network architecture employs masked self-attention to have tokens from later stages attend to tokens from previous stages. Language instruction tokens are merged into stage 0 image tokens when input into transfer, which is omitted in the figure for conciseness.




Our method scales to real robot. VIHE iteratively refines its 3D action prediction by rendering 2D in-hand views based on the previous stage predictions. Color coding of gray, green, and blue represent three action prediction stages respectively. (See below)



Results

As presented in the table, VIHE surpasses all baselines in success rate when averaged across all tasks. With 100 demonstrations, it outperforms the existing SOTA method RVT by 17 percentage points (a 28% relative improvement) and Act3D by 12 percentage points (an 18% relative improvement). Our method's advantage increases when only 10 demonstrations are provided, outperforming Act3D by 14 percentage points (a 29% relative improvement). Remarkably, our performance using 10 demonstrations per task is on par with the performance using 100 demonstrations per task for RVT and Act3D.

These results demonstrate that our method is both more accurate and more sample-efficient compared to existing state-of-the-art methods. The improvements mainly come from large gains in challenging high-precision tasks.



Interpolate start reference image.