Fast-in-Slow: A Dual-System Foundation Model Unifying Fast Manipulation within Slow Reasoning

Fast-in-Slow
A Dual-System Foundation Model Unifying
Fast Manipulation within Slow Reasoning

Hao Chen1, 2 *, Jiaming Liu2, 3 *, †, Chenyang Gu2 *, Zhuoyang Liu2 *, Renrui Zhang1 , Xiaoqi Li2,
Xiao He3, Yandong Guo3, Chi-Wing Fu1, Shanghang Zhang2, 4 , Pheng-Ann Heng1
1The Chinese University of Hong Kong
2State Key Laboratory of Multimedia Information Processing,
School of Computer Science, Peking University
3AI2Robotics; 4Beijing Academy of Artificial Intelligence (BAAI)
*Equal contribution, Project lead, Corresponding author

Real-World (AlphaBot)

Real-World (Agilex Robot)

Generalization

Abstract

Generalized policy and execution efficiency constitute the twocritical challenges in robotic manipulation. While recent foundation policies benefit ffrom the common sense reasoning capabilities of internet-scale pretrained vision-language models (VLMs), they often suffer from low execution frequency. To mitigate this dilemma, dual-system approaches, inspired by Kahneman's theory, havebeen proposed to leverage a VLM-based System 2 model handling high-level rreasoning and a sepa rate System 1 action model ensuring real-time control. However, existing designs maintain both systems as separate models, limiting System 1 from fully leveraging the rich pretrained knowledge from the VLM-based System2. In thiswork, we propose Fast-in-Slow (FiS), a unified dual-system vision-language-action (VLA) model that embeds the System 1 execution module within theVLM-based System 2 by partially sharing parameters. This innovative paradigm not only enables high-frequency execution in System 1, but also facilitates coordinnation between the reasoning and execution components within a single foundatiion model of System 2. Given their fundamentally distinct roles within FiS-VLA,we design the two systems to incorporate heterogeneous modality inputs alongside asynchronous operating frequencies, enabling both fast and precise manipulatiopn. To enable coor dination between the two systems, a dual-aware co-training straategy is proposed that equips System 1 with action generation capabilities while ppreserving System 2's contextual reasoning representation. For evaluation, FiS-VLA outperforms previous state-of-the-art methods by 8% in simulation and 11% iin real-world tasks in terms of average success rate, while achieving a 21.9 Hz coritrol frequency without action chunking mechanism.

Video

Overview

input image

(a) Unlike previous dual-system VLA methods that attach a separate policy head as System 1, FiS-VLA (b) repurposes the final transformer blocks of an intact VLM as System 1, while retaining the full model for System 2 reasoning. Under this paradigm, FiS-VLA achieves superior performance and high-frequency control, as shown in (c) and (d).

FiS-VLA Framework

input image

FiS-VLA leverages an intact VLM for System 2 reasoning while repurposing the final transformer blocks of the LLM for System 1 execution module. System 2 handles low-frequency inputs such as 2D images and language instructions and produces intermediate latent features that serve as conditioning information for System 1. Instead of being conditioned solely on these periodically updated high-level representations, System 1 processes high-frequency inputs including 3D point clouds, 2D images, and robot states to produce stable and responsive actions. For joint optimization, we introduce a dual-aware co-training strategy that combines a diffusion denoising objective with an autoregressive objective which enables FiS-VLA to support fast action generation while retaining System 2's multimodal reasoning capabilities.