Vision-Language-Action Models
An overview of Vision-Language-Action (VLA) models that enable robots to understand language instructions and perform manipulation tasks.
Ecosystem Snapshot
Leading Models
Helix is Figure AI's proprietary VLA model for generalist humanoid control with zero-shot manipulation and multi-robot collaboration.
pi0 (pi-zero) is Physical Intelligence's generalist VLA robot foundation model for zero-shot dexterous manipulation across 8 robot types.
OpenVLA is a pioneering open-source 7B VLA model combining a pretrained VLM with action de-tokenization for zero-shot robot manipulation.
RT-2 is a Google DeepMind VLA model transferring web-scale knowledge to robotic control via VLM fine-tuning.
RT-1 (Robotics Transformer) is Google DeepMind's pioneering scalable Transformer model for real-world robot control, demonstrating that task-agnostic training and high-capacity architectures enable generalizable robotic policies.
A vision-language-action (VLA) model trained on the Open X-Embodiment dataset. Part of the RT-X model family, building on RT-2 with cross-embodiment capabilities.
Leading Projects
Industry Insights
This page aggregates Vision-Language-Action (VLA) models that combine internet-scale vision-language pretraining with robot control outputs. VLA models represent a paradigm shift in robotics, enabling zero-shot generalization, cross-embodiment transfer, and natural language-driven task execution.
The collection includes both proprietary industry models (Helix, RT-2) and open-source alternatives (OpenVLA, π0), covering a range of architectures, training datasets, and deployment scenarios.