Bi-AQUA: Bilateral Control-Based Imitation Learning for Underwater Robot Arms via Lighting-Aware Action Chunking with Transformers

The University of Osaka / Kobe University

Takeru Tsunoori*, Masato Kobayashi*, Yuki Uranishi
* Co-first authors equally contributed to this work.
  • Takeru Tsunoori (Master’s student, 1st year) is primarily responsible for 3D printed hardware development, data preprocessing software, and experimental execution.
  • Masato Kobayashi (Assistant Professor, project tech lead) is primarily responsible for robot learning model construction, computer vision and robot control software, website, and video editing.

6-minute video! Please enjoy our Bi-AQUA! We introduce Bi-AQUA, the first underwater bilateral control-based imitation learning framework that integrates lighting-aware visual processing for underwater robot arms.

Overview

Abstract:
Underwater robotic manipulation is fundamentally challenged by extreme lighting variations, color distortion, and reduced visibility. We introduce Bi-AQUA, the first underwater bilateral control-based imitation learning framework that integrates lighting-aware visual processing for underwater robot arms. Bi-AQUA employs a hierarchical three-level lighting adaptation mechanism: a Lighting Encoder that extracts lighting representations from RGB images without manual annotation and is implicitly supervised by the imitation objective, FiLM modulation of visual backbone features for adaptive, lighting-aware feature extraction, and an explicit lighting token added to the transformer encoder input for task-aware conditioning. Experiments on a real-world underwater pick-and-place task under diverse static and dynamic lighting conditions show that Bi-AQUA achieves robust performance and substantially outperforms a bilateral baseline without lighting modeling. Ablation studies further confirm that all three lighting-aware components are critical. This work bridges terrestrial bilateral control-based imitation learning and underwater manipulation, enabling force-sensitive autonomous operation in challenging marine environments.

Contributions

Our contributions are as follows:

  • Bi-AQUA is the first bilateral control–based imitation learning framework for underwater robot arms.
  • We propose a lighting-aware visuomotor policy that integrates a label-free, implicitly supervised Lighting Encoder, FiLM modulation, and a lighting token for adaptive control.
  • We demonstrated real-world gains over baseline and generalization to unseen lighting conditions, novel objects, and visual disturbances in underwater environments.

Short Video: Dynamic Changing and Dynamic Bubble Environments via Bi-AQUA.

Challenges in Underwater Manipulation

Underwater robotic manipulation is uniquely challenging because the visual appearance of the same scene can change drastically within seconds under shifts in the spectrum, intensity, and direction of underwater lighting. Such variations—caused by wavelength-dependent attenuation, scattering, turbidity, and bubbles—break the visual consistency required by standard visuomotor policies, often causing catastrophic action drift even on simple pick-and-place tasks. Existing underwater image enhancement methods improve image quality, but they do not address the fundamental control problem: closed-loop manipulation policies must adapt both perception and action generation to rapidly varying lighting.

Traditional underwater manipulation is dominated by teleoperation, motivating recent work on AUVs, underwater robot arms, and visuomotor learning. However, existing approaches either rely solely on vision or use unilateral control without force feedback, and none provide mechanisms to explicitly model lighting as a latent factor influencing visual observations.

Research Gaps and Our Approach

To clarify why Bi-AQUA uniquely addresses the challenges of underwater manipulation, we compare representative approaches across five key dimensions: RGB vision, underwater deployment, bilateral control, robot policy learning, and explicit lighting modeling. The following analysis reveals critical gaps that motivate our approach. Bi-AQUA is the first Bi-IL framework explicitly designed for real underwater manipulation under dynamic lighting, bridging the gap between bilateral control-based imitation learning and underwater manipulation.

Why Underwater?

Existing underwater manipulation approaches fall into three categories, each with critical limitations:

  1. Image Enhancement Methods (SelfLUID-Net, DeepSeeColor): These methods improve underwater image quality and model lighting, but they operate at the image level and do not address closed-loop control. They lack robot policy learning and cannot adapt action generation to lighting changes during manipulation.

  2. Underwater Teleoperation Systems (Underwater Bilateral, Ocean One): While these systems support bilateral control and underwater deployment, they remain human-operated and do not learn autonomous policies. They also lack explicit lighting modeling, relying on human operators to compensate for visual degradation.

  3. Underwater Imitation Learning (U0, UIVNAV, AquaBot): These methods learn robot policies for underwater tasks but use unilateral control without force feedback. This limits their robustness in contact-rich manipulation, where haptic cues are essential. Critically, none of these approaches explicitly model lighting as a latent factor, leaving them vulnerable to the rapid appearance shifts characteristic of underwater environments.

Gap: No existing method combines underwater deployment, autonomous policy learning, and explicit lighting modeling. Bi-AQUA is the first to unify these capabilities.

Why Bilateral Control-Based Imitation Learning?

In terrestrial environments, imitation learning (IL) through leader–follower teleoperation has proven highly effective for acquiring complex visuomotor skills. Recent systems based on ALOHA and Action Chunking with Transformers (ACT), Mobile ALOHA with ACT, and diffusion-based visuomotor policies achieve impressive real-world performance. However, these methods all rely on unilateral control, as shown below. Without force feedback, these methods struggle in contact-rich or visually ambiguous interactions.

Unilateral Control-based Imitation Learning
Unilateral Control-based Imitation Learning

Bilateral control addresses this limitation by exchanging both position and force, enabling demonstrations rich in haptic and visual cues and yielding more robust generalization, as shown below. Among bilateral IL frameworks, Bi-ACT plays a central role by directly extending ACT to bilateral settings, offering a strong backbone for force-sensitive visuomotor learning. However, all such work assumes visually stable land environments and does not address the severe lighting variability characteristic of underwater settings.

Bilateral Control-based Imitation Learning
Bilateral Control-based Imitation Learning

Bilateral control-based imitation learning (Bi-IL) offers significant advantages over unilateral approaches, but existing Bi-IL frameworks have critical limitations:

  1. Vision-Free Bi-IL (Bi-IL without vision): These methods leverage bilateral control for force-sensitive learning but operate without visual input, making them perceptually blind. This is particularly problematic in underwater environments, where visual feedback is essential for object localization and manipulation planning.
  1. Terrestrial Bi-IL (Bi-ACT): Bi-ACT successfully combines bilateral control with RGB vision for terrestrial manipulation, demonstrating the power of force-rich demonstrations with high-dimensional perception. However, it assumes visually stable land environments and has not been evaluated underwater, where lighting variability fundamentally breaks visual consistency.

Gap: Existing Bi-IL frameworks either lack vision (limiting underwater applicability) or assume stable lighting (incompatible with underwater conditions). Bi-AQUA extends Bi-IL to underwater settings while explicitly modeling lighting variability.

Why Lighting Modeling?

Lighting variability is a dominant factor degrading visuomotor policy performance, especially in underwater environments. Existing approaches to lighting robustness have fundamental limitations:

  1. Domain Randomization and Data Augmentation: Prior work improves robustness through domain randomization and physically based relighting for data augmentation. While effective for training, these strategies do not provide the policy with an explicit, internal representation of lighting. The visuomotor pipeline must implicitly cope with color and illumination changes, which is insufficient for rapid, intra-horizon lighting shifts in underwater settings.

  2. Implicit Lighting Conditioning: Some methods (e.g., RoLight) model lighting for terrestrial manipulation, but they treat lighting only implicitly through global conditioning or augmentation, without identifying lighting as a distinct latent factor that affects both perception and downstream control.

  3. Image Enhancement Without Control Integration: Underwater image enhancement methods (SelfLUID-Net, DeepSeeColor) explicitly model lighting to improve image quality, but they operate independently of the control loop. They do not enable the policy to adapt action generation to current lighting conditions.

Gap: No existing method provides explicit, hierarchical lighting modeling within a visuomotor policy for underwater manipulation. Bi-AQUA introduces a label-free Lighting Encoder that learns lighting representations implicitly through the imitation objective, modulates visual features via FiLM, and conditions action generation through a lighting token—enabling adaptation to rapidly changing underwater illumination.

Summary: Bi-AQUA is the first framework to simultaneously address all three gaps: (i) underwater deployment with autonomous policy learning and explicit lighting modeling, (ii) bilateral control-based imitation learning with vision for underwater manipulation, and (iii) hierarchical lighting modeling that adapts both perception and control to dynamic underwater illumination.

Bi-AQUA Model

Bi-AQUA Model
Bi-AQUA Model
We introduce Bi-AQUA, the first underwater bilateral control-based imitation learning framework that integrates lighting-aware visual processing for underwater robot arms. Bi-AQUA extends the Bi-ACT architecture by incorporating lighting-aware visual processing into a transformer-based action prediction model. The system consists of:

  1. a label-free lighting-aware visual encoder,
  2. a joint state encoder,
  3. a transformer-based action prediction module,
  4. bilateral control integration for closed-loop execution.

Bi-AQUA employs a hierarchical three-level lighting adaptation mechanism:

  • Lighting Encoder: Extracts compact lighting embeddings from RGB images without manual annotation, implicitly supervised by the imitation objective
  • FiLM-based feature-wise modulation: Modulates backbone features conditioned on the lighting embedding for adaptive, lighting-aware feature extraction. Following the formulation of FiLM, we modulate each convolutional feature map via a feature-wise affine transformation whose parameters are generated from the lighting embedding. In our implementation, FiLM is applied to the final ResNet layer.
  • Lighting token: Added to the transformer encoder input alongside latent and proprioceptive tokens. The encoder processes these tokens together, and the resulting memory (which includes the lighting information) is then accessed by the transformer decoder via cross-attention, enabling the decoder to adapt action generation to the current underwater lighting.

Lighting Encoder

Lighting Encoder
Lighting Encoder
Given multi-view observations, our goal is to extract manipulation-relevant features while explicitly modeling lighting. We design a Lighting Encoder that produces a compact lighting representation from RGB images without lighting or color annotations, which are difficult to define in underwater settings.

The encoder adopts a dual-path architecture. A convolutional path processes the input through convolutional layers with channels 3 → C₁ → C₂ → C₃, followed by ReLU and global average pooling to yield spatial features. A histogram path computes a 2D histogram over saturation and value (SV) channels and passes it through a two-layer MLP. The final lighting embedding is obtained by concatenating and linearly projecting the features from both paths. For multiple cameras, we use a shared encoder and average across views.

Data Collection: Bilateral Control for Underwater Manipulation

Bi-AQUA Data Collection
Bi-AQUA Data Collection
Data collection follows the bilateral control paradigm of Bi-ACT, adapted to an underwater follower. A human operator manipulates a leader robot in air, and a follower robot executes mirrored motions underwater. Bilateral control enforces simultaneous position tracking and force feedback between leader and follower, enabling demonstrations rich in haptic and visual cues. To ensure robustness, demonstrations are collected under diverse lighting spectra and intensities.

Inference

Bi-AQUA Inference
Bi-AQUA Inference
At test time, Bi-AQUA runs in closed loop with the bilateral controller. At each inference step, the model receives current multi-view images and follower joint state, computes lighting embeddings with the Lighting Encoder, applies FiLM modulation to the backbone, samples a latent action code from the prior, and uses the transformer decoder to generate an action chunk for the leader robot. The predicted leader actions are converted into follower commands by the bilateral controller and executed at control frequency.

Experiments

Overview

We evaluate Bi-AQUA on a real-world underwater pick-and-place task that requires robust performance under diverse static and dynamic lighting conditions. A human operator provides demonstrations via a leader robot; the policy is executed on a follower robot trained from those demonstrations under bilateral control.

Experimental environments
Experimental Environments

Hardware Setup

  • Robot: Dynamixel XW540-T260 (IP68) servo motors, 3 joints (J1-J3) + gripper (J4)
  • Water Tank: 900 × 450 × 450 mm, follower arm mounted on 200 mm pedestal
  • Lighting: Multi-color RGBW LED aquarium light (FEDOUR RD10-400RGBW) providing eight reproducible lighting conditions (red, blue, green, white, rgbw, cyan, purple, changing)
  • Cameras: Two RGB cameras (TIER IV C1 120) - gripper-mounted camera for close-up views and top camera capturing the global workspace
  • Bilateral control: Leader–follower, four-channel (position/force exchange)

Task Setup

The task is a pick-and-place designed to stress lighting robustness. The follower must grasp a block on the right side of the tank and place it into a basket on the opposite side, passing through four stages: (1) initial pose, (2) pick up, (3) transport, (4) place.

For teleoperated data collection, we execute the water task twice under each of five lighting conditions (red, blue, green, white, rgbw), yielding 10 demonstrations. The primary training object is a polyurethane block.

For autonomous evaluation, we test all eight lighting conditions, including the dynamic changing mode and two unseen colors (cyan, purple). To assess appearance generalization, we additionally evaluate: (a) a black rubber block, (b) a blue sponge, and (c) bubbles around the polyurethane block. Each configuration is evaluated over 5 autonomous rollouts and considered successful if the object ends inside the basket without being dropped.

Teleoperation data collection
Teleoperation Data Collection
Generalization to novel objects and disturbances
Generalization to Novel Objects and Disturbances

Training Setup

  • Sensing/Logging: 1000 Hz joint angle/velocity/torque for both leader and follower (4 joints × 3 = 12 per arm; 24-D combined). RGB images at 30 Hz (1920 × 1280), resized to 480 × 320.
  • Demonstrations: 10 bilateral teleoperation episodes (two per training lighting condition). All demonstrations execute the same pick-and-place task; the operator follows an identical intended motion pattern across episodes so that the primary variation comes from lighting rather than from changes in the manipulation strategy.
  • Preprocessing: Joint data downsampled to 100 Hz, images temporally aligned to 100 Hz by selecting the frame closest to each 10 ms interval. This results in 15,402 time steps, corresponding to 30,804 RGB frames in total (two views per time step) and 15,402 synchronized joint-state vectors; each contains 24 values (leader+follower, 4 joints × position/velocity/torque), totaling 369,648 values.
  • Note: Although each episode is collected under a known lighting mode (e.g., red, blue, green), we do not provide any lighting IDs or manual labels to the policy. The Lighting Encoder is trained end-to-end from the control objective and learns a latent representation of lighting purely from visual input.
  • Training: CVAE-style action chunks with KL regularization (λ_KL = 1.0), using AdamW optimizer for both the ResNet-18 backbone and the remaining modules
  • Deployment: Trained policies run in real time on the physical robot using an NVIDIA RTX 3060 Ti GPU (8 GB VRAM)
Overview of collected datasets
Overview of Collected Datasets

Results

Autonomous execution
Autonomous Execution of Bi-AQUA
Results
Results of Bi-AQUA

Key Findings:

  • Bi-AQUA achieves consistently high robustness, attaining 100% success in seven of the eight lighting conditions and maintaining 80% success even under blue light, which is the most challenging condition due to severe wavelength-dependent attenuation.
  • Crucially, Bi-AQUA achieves 100% and clearly outperforms all baselines in the dynamic changing mode, where lighting cycles through all spectra every 2 sec. This setting induces rapid, intra-horizon appearance shifts that invalidate the stationarity assumptions of standard visuomotor policies.
  • The lighting-agnostic Bi-ACT baseline succeeds only under white (100%) and partially under red (20%), and fails completely in the remaining six conditions despite using the same demonstrations. This sharp performance gap demonstrates that underwater visuomotor control fundamentally requires explicit lighting modeling rather than relying on passive robustness of the visual backbone.

Ablation Studies:

Autonomous execution
Ablation Studies

The ablations isolate the role of each lighting-aware component:

  • Bi-ACT+LE-Token, which uses a Lighting Encoder to extract lighting embeddings and adds a lighting token to the transformer encoder input (conditioning the decoder via memory) without modifying the visual features via FiLM, fails in all lighting conditions. This indicates that high-level conditioning alone is insufficient for correcting color shifts and contrast changes introduced by underwater propagation.
  • Bi-ACT+LE-FiLM, which uses a Lighting Encoder to extract lighting embeddings and applies FiLM modulation to the final ResNet layer (layer4) but omits the lighting token, performs substantially better, achieving up to 100% success in five static or unseen lighting conditions. However, its success drops sharply to 20% in the dynamic changing mode, suggesting that FiLM alone cannot track rapid temporal lighting transitions.
  • Only the full Bi-AQUA model—combining Lighting Encoder, FiLM modulation, and a sequence-level lighting token—maintains uniformly high performance across all lighting conditions. This confirms that hierarchical lighting reasoning, spanning both feature modulation and sequence-level conditioning, is essential for underwater manipulation under severe lighting variability.
Execution time profiles
Execution Time Profiles

Averaged over all lighting conditions, Bi-ACT takes 20.17 sec per successful rollout, Bi-ACT+LE-FiLM 19.61 sec, Bi-AQUA 15.73 sec, and human teleoperation 15.39 sec. Bi-AQUA closely matches the teleoperation reference in both mean time and variance, suggesting that explicit lighting modeling not only increases success rates but also preserves efficient, human-like trajectories.

Generalization Analysis:

  • For the black block, Bi-AQUA achieves success rates between 60% and 100%, with perfect performance under white, rgbw, cyan, and purple lighting. This indicates robustness to drastic changes in albedo and object contrast.
  • For the blue sponge, which introduces significant geometric, textural, and chromatic variation, performance varies more widely (0–100%). Even under these severe appearance shifts, Bi-AQUA maintains non-trivial success rates across multiple lighting modes, achieving 80% under cyan and 40% in the dynamic changing mode.
  • The bubble condition is particularly challenging from a perception standpoint. Because the top camera observes the scene through the water surface from above, air injection generates surface waves that induce strong refraction and temporal warping, severely degrading the global view of the tank. At the same time, bubbles rise around the gripper and the object, intermittently occluding the local workspace and creating high-frequency specular highlights and false edges in the gripper view. Despite these compounded distortions, Bi-AQUA still achieves up to 100% success in several lighting conditions, indicating substantial resilience to simultaneous global (water-surface) and local (near-gripper) disturbances.

Overall, these results show that Bi-AQUA not only provides strong lighting robustness but also extends to unseen objects and disturbance conditions without any additional data or fine-tuning.

Limitation

Our evaluation is restricted to a single task, robot platform, and tank environment with a finite set of lighting modes, and performance still degrades under extreme combinations of lighting and object appearance. We also do not yet consider variation in water quality, background clutter, or large-scale field deployments. Extending Bi-AQUA to richer skill repertoires, more realistic subsea settings, and integration with other adaptation mechanisms (e.g., domain randomization or online adaptation) are important directions for future work.

Summary & Discussion

Bi-AQUA is the first bilateral control-based imitation learning framework for underwater robot arms that explicitly models lighting at multiple levels. Through a label-free Lighting Encoder implicitly supervised by the imitation objective, FiLM-based conditioning of visual features, and a sequence-level lighting token added to the transformer encoder input, Bi-AQUA adapts its visuomotor behavior to changing lighting while preserving the force-sensitive advantages of bilateral teleoperation. Ablation and generalization experiments show that the lighting-aware components provide complementary gains and that the learned lighting representation transfers to novel objects and visual disturbances. This work bridges terrestrial bilateral control-based imitation learning and underwater manipulation, enabling force-sensitive autonomous operation in challenging marine environments.

6-minute video! Please enjoy!

Citation

@misc{tsunoori2025biaquabilateralcontrolbasedimitation,
      title={Bi-AQUA: Bilateral Control-Based Imitation Learning for Underwater Robot Arms via Lighting-Aware Action Chunking with Transformers}, 
      author={Takeru Tsunoori and Masato Kobayashi and Yuki Uranishi},
      year={2025},
      eprint={2511.16050},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2511.16050}, 
}

Contact

Masato Kobayashi (Assistant Professor, The University of Osaka, Kobe University, Japan)

Acknowledgments

This work was supported by Research Grant (B) from the Tateisi Science and Technology Foundation, Japan, with M. Kobayashi as the project leader.

We used the underwater robot arm hardware developed in MR-UBi: Mixed Reality-Based Underwater Robot Arm Teleoperation System with Reaction Torque Indicator via Bilateral Control. We thank Kohei Nishi for his contributions to the hardware development.