Bi-VLA: Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation

The University of Osaka / Kobe University

Masato Kobayashi*, Thanpimon Buamanee*
* Co-first authors equally contributed to this work.
Overview

Abstract:
We propose Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation (Bi-VLA), a novel framework that extends bilateral control-based imitation learning to handle more than one task within a single model. Conventional bilateral control methods exploit joint angle, velocity, torque, and vision for precise manipulation but require task-specific models, limiting their generality. Bi-VLA overcomes this limitation by utilizing robot joint angle, velocity, and torque data from leader-follower bilateral control with visual features and natural language instructions through SigLIP and FiLM-based fusion. We validated Bi-VLA on two task types: one requiring supplementary language cues and another distinguishable solely by vision. Real-robot experiments showed that Bi-VLA successfully interprets vision-language combinations and improves task success rates compared to conventional bilateral control-based imitation learning. Our Bi-VLA addresses the single-task limitation of prior bilateral approaches and provides empirical evidence that combining vision and language significantly enhances versatility. Experimental results validate the effectiveness of Bi-VLA in real-world tasks.

Bi-VLA Model

Bi-VLA Model

The Bi-VLA learning model is built upon a Transformer-driven Conditional Variational Autoencoder (CVAE) architecture. In this framework, the model receives the follower robot joints’ angle, velocity, and torque data along with corresponding visual inputs and natural language instructions. Based on this multimodal input, the model outputs sequences of action chunks that specify the predicted joint angles, velocities, and torques of leader robots. Natural language instructions are processed by a dedicated language encoder (LE), which converts the textual commands into fixed-length vector representations.

Data Collection:four-channel bilateral control

Bi-VLA Data Collection
Bi-VLA employs four-channel bilateral control method for data collection. The operator teleoperates the follower robot via the leader robot, while natural language instructions describing the movements are recorded alongside robot joint data and camera images.

Inference

Bi-VLA Inference
Bi-VLA model receives the most recent follower joint states, synchronized camera images, and natural language instructions specifying the intended action. The model predicts the next action chunk of the leader robot joint trajectories, including angle, velocity, and torque. These outputs are converted into current commands by the bilateral control system and applied to the follower robot in real time. This closed-loop execution enables Bi-VLA to generate task-appropriate actions conditioned on multimodal inputs, supporting flexible, vision-language driven manipulation in real environments.

Experiments

Overview

We evaluate Bi‑VLA on real‑world pick‑and‑place tasks that require either language‑based disambiguation or vision‑based disambiguation. A human operator provides demonstrations via a leader robot; the policy is executed on a follower robot trained from those demonstrations under bilateral control.

Experimental environments and cameras
Experimental Environments

Hardware Setup

  • Robot: ROBOTIS OpenManipulator‑X (4 revolute joints + gripper = 5 DoF)
  • Cameras: Two RGB cameras (overhead and gripper‑mounted), 640×360 @ 100 Hz
  • Bilateral control: Leader–follower, four‑channel (position/force exchange)

Task Setup

We design two pick‑and‑place tasks to isolate when language is required vs. when vision alone suffices.

(A) Two‑Target (Language‑Disambiguable) — pick a ball from a fixed source and place it at Up or Down target specified by a language command. Initial visuals are identical, so vision cannot disambiguate the goal.

Two‑Target task
Data Collection of Two‑Target Task

(B) Two‑Source (Vision‑Disambiguable) — pick the ball from Up or Down source (visually distinct) and place at a fixed target. We additionally test an unlearned 3‑ball environment with a distractor to degrade saliency.

Two‑Source task
Data Collection of Two‑Source Task

Each setting uses n = 10 evaluation trials per condition. A trial is successful only if Pick → Move → Place complete without unintended drop outside the target area.

Training Setup

Data Collection (Real-Time, 1X)

Model variants

ModelTraining ScopeLanguage EncoderDemonstrations
Bi‑ACTTwo‑Target or Two‑Source (separate per model)6 (3 Up, 3 Down)
Bi‑VLA (DistilBERT)Two‑Target onlyDistilBERT6 (3 Up, 3 Down)
Bi‑VLA (SigLIP)Two‑Target or Two‑Source (separate per model)SigLIP6 (3 Up, 3 Down)
Bi‑VLA (SigLIP‑Mix)Multi‑task: Two‑Target + Two‑Source (mixed)SigLIP4 raw (1 per condition) → 40 with DABI

SigLIP‑Mix uses a deliberately reduced data to test cross‑task generalization under limited supervision.

  • Sensing/Logging: 1000 Hz joint angle/velocity/torque for both leader and follower (5 joints × 3 = 15 per arm; 30‑D combined). RGB images at 100 Hz.

  • Demonstrations: For each task, 6 demos (3×Up, 3×Down) with paired language:

    • Two‑Target: “put ball upward” / “put ball downward”
    • Two‑Source: “pick ball upward” / “pick ball downward”
  • Augmentation: DABI downsampling/augmentation to align 1000 Hz control with 100 Hz images; expands 6 demos → 60.

For Bi-VLA (SigLIP-Mix), we deliberately imposed a reduced data budget to test cross-task generalization. Only four raw demonstrations were collected in total—one for each condition (Target-Up, Target-Down, Source-Up, Source-Down). After DABI augmentation, this produced 40 training data. This design provided a controlled setting to evaluate whether a single multimodal policy could accommodate heterogeneous tasks without task-specific retraining.

Two‑Target joint data
Two‑Target Joint Data (Red: Up, Blue: Down)
Two‑Source joint data
Two‑Source Joint Data (Red: Up, Blue: Down)

Results

Bi-VLA (Real-Time, 1X)

(A) Two‑Target (Language‑Disambiguable)

MethodTargetPickMovePlaceOverall
Bi‑ACTUp100100100100
Down0000
50
Bi‑VLA (DistilBERT)Up100100100100
Down1001002020
60
Bi‑VLA (SigLIP)Up80808080
Down100100100100
90
Bi‑VLA (SigLIP‑Mix)Up70707070
Down70707070
70

Key takeaways: Without language, Bi‑ACT collapses to an Up‑only policy (50%). Language input is indispensable here; SigLIP provides the best grounding and execution (90%), outperforming DistilBERT (60%). SigLIP‑Mix remains balanced across Up/Down under low‑data multi‑task training (70%).

(B) Two‑Source (Vision‑Disambiguable)

MethodSourcePickMovePlaceOverall
Bi‑ACTUp90909090
Down100100100100
95
Bi‑VLA (SigLIP)Up100100100100
Down80808080
90
Bi‑VLA (SigLIP‑Mix)Up100100100100
Down80808080
90

Observation: When vision suffices, adding language offers no extra gain, but also does not hurt. Multi‑task training (SigLIP‑Mix) matches task‑specific SigLIP.

(C) Two‑Source with Unlearned 3‑Ball Distractor

MethodSourcePickMovePlaceOverall
Bi‑ACTUp100100100100
Down0000
50
Bi‑VLA (SigLIP)Up100100100100
Down50505050
75
Bi‑VLA (SigLIP‑Mix)Up90909090
Down70606060
75

Generalization: With a visual distractor, language grounding helps avoid biased collapse and sustains substantially higher overall success (75%) than vision‑only (50%).

Summary & Discussion

Across all settings, Bi‑VLA (SigLIP) reliably resolves language‑dependent ambiguity and preserves strong vision‑based control. Bi‑VLA (SigLIP‑Mix) demonstrates multi‑task, low‑data generalization without negative interference between tasks. Together, results show that fusing vision + language within bilateral control yields a single policy capable of flexible task switching and robust performance in real environments.

Citation

@misc{kobayashi2025bivlabilateralcontrolbasedimitation,
      title={Bi-VLA: Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation}, 
      author={Masato Kobayashi and Thanpimon Buamanee},
      year={2025},
      eprint={2509.18865},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.18865}, 
}

Contact

Masato Kobayashi (Assistant Professor, The University of Osaka, Kobe University, Japan)