RT-1

Modelactive

RT-1 (Robotics Transformer) is a foundational model class for real-world robot control developed by Google DeepMind (Anthony Brohan, Noah Brown, et al.), published in December 2022. It demonstrated that Transformers trained on large-scale, diverse robotic data can produce generalizable robot policies, paving the way for the VLA (Vision-Language-Action) paradigm. The key innovation of RT-1 is its demonstration of scalable model properties: as training data size, model capacity, and data diversity increase, the model's ability to generalize to new tasks, objects, and environments improves predictably. This scaling behavior — previously observed in computer vision and NLP — was shown to hold for robotic control for the first time. RT-1 uses a Transformer architecture that processes a history of image observations and task instructions, outputting discrete action tokens representing robot arm and gripper commands. It was trained on a large dataset of real-world robot demonstrations collected across a fleet of robots over 17 months, comprising over 130,000 episodes covering 700+ tasks. RT-1's architecture was designed for open-ended, task-agnostic training — learning from a wide variety of tasks simultaneously rather than being specialized for a single task. This approach enables zero-shot generalization to novel task combinations and environmental conditions not seen during training. As a proprietary Google DeepMind model, RT-1 was not open-sourced, but its architecture and findings directly inspired subsequent open-source efforts including OpenVLA and Octo. RT-2, its successor, extended the paradigm by incorporating web-scale vision-language pretraining, demonstrating further scaling benefits.