What Are VLA Models?
A vision-language-action (VLA) model is a single AI model that takes in what a robot sees (vision) and an instruction in plain language, and outputs the actions — motor or controller commands — to carry it out. In other words, it maps perception plus a command directly to motion, acting as the "brain" of a modern robot.
Key takeaways
- VLA = vision + language → action, learned end-to-end in one model.
- VLA models were behind roughly 40% of new robot deployments in 2026.
- They let a robot follow natural-language instructions instead of fixed programs.
- They run on the edge — Thor-class compute is in demand because of them.
- They are trained on simulation, synthetic data and teleoperation.
- Examples: NVIDIA Isaac GR00T, Cosmos Reason, Google RT-2, Microsoft Rho-alpha.
How a VLA model differs from earlier robot AI
Traditional robot software is a pipeline of separate stages: a perception model detects objects, a planner decides a path, and hand-written control code drives the motors. A VLA model collapses much of that into one learned model: pixels and a text instruction go in, an action comes out. That makes it far more flexible — the same model can handle tasks and objects it was not explicitly programmed for, and it can be told what to do in everyday language.
How a VLA model works, step by step
In one control cycle a VLA model: (1) takes camera frames and a task instruction such as "put the red part in the tray"; (2) encodes both into a shared representation that links the words to what it sees; (3) predicts the next action — a target pose, gripper state or motor command — rather than a sentence; (4) the controller executes it; and (5) the next frame closes the loop. It repeats this many times a second, which is why it needs real-time compute on the robot. Crucially, the mapping from "what I see + what I was told" to "what I do" is learned, not hand-coded, so the same model generalizes to new objects and phrasings it never saw in a fixed program.
How VLA differs from an LLM
A large language model (LLM) outputs text. A VLA model is built on the same transformer idea but adds vision as input and actions as output, grounded in a physical body. Where an LLM predicts the next word, a VLA predicts the next movement — and it has to do so within a real-time control budget, not whenever it finishes thinking. Some VLA systems also add a reasoning step (a "VLA+" or reasoning VLM such as Cosmos Reason) that thinks briefly about the scene before acting.
What VLA models are good at — and their limits
VLA models shine where tasks vary and objects are not identical every time: mixed-case pick-and-place, handling deformable or novel items, and following spoken or written instructions on the floor. Their limits matter too: they can fail unpredictably on out-of-distribution scenes, they are data-hungry to train, and they demand more on-board compute than a fixed CNN. For a rigid, unchanging motion repeated a million times, a classic control policy is still cheaper and more predictable — VLA earns its cost where flexibility does.
Why VLA models matter now
Three things converged in 2025–2026: capable models small enough to run on-device, robot training data that fell sharply in price (high-quality teleoperation dropped from about $340/hour in 2024 to roughly $118/hour in 2026), and edge compute powerful enough to run them. The result is that VLA models went from research demos to backing around 40% of new robot deployments, and they are the highest-growth, highest-margin layer of the physical-AI market.
Leading VLA and robot foundation models
| Model | From | Notes |
|---|---|---|
| Isaac GR00T | NVIDIA | Generalist humanoid foundation model, trained with Isaac Sim |
| Cosmos Reason | NVIDIA | Reasoning vision-language model for physical AI, runs on edge |
| RT-2 | Google DeepMind | Early influential vision-language-action model |
| Rho-alpha | Microsoft | VLA+ model adding tactile sensing (announced Jan 2026) |
What hardware VLA models need
Running a VLA model on a robot needs more on-device compute and memory than classic perception. That is the main reason demand shifted toward NVIDIA Jetson AGX Thor (Blackwell GPU, 128 GB) and IGX Thor for safety-critical builds, with AGX Orin still viable for smaller models. See edge AI inference, a Jetson Orin vs Thor comparison, and robotics & physical-AI hardware.
How VLA models are trained
VLA models learn from demonstrations and trials. The data comes from teleoperation (a human driving the robot), simulation and synthetic data, and limited real-world fine-tuning — see teleoperation and synthetic data and sim-to-real training. Haink builds and deploys these models on supplied hardware as part of Physical AI solutions.
Frequently asked questions
What is a VLA model?
A vision-language-action (VLA) model is an AI model that takes in vision (what a robot sees) and a natural-language instruction, and outputs actions such as motor or controller commands. It maps perception plus a command directly to motion.
How is a VLA model different from an LLM?
An LLM outputs text; a VLA model adds vision as input and physical actions as output, grounded in a robot body, and must run within a real-time control budget.
What are examples of VLA models?
Examples include NVIDIA Isaac GR00T and Cosmos Reason, Google DeepMind RT-2, and Microsoft Rho-alpha, which adds tactile sensing.
What hardware runs VLA models?
VLA models need substantial on-device compute, so they commonly run on NVIDIA Jetson AGX Thor or IGX Thor, with AGX Orin viable for smaller models.
How are VLA models trained?
Through teleoperation data, simulation and synthetic data, and limited real-world fine-tuning, then optimization to run on the target edge platform.
