Learning High-Frequency Continuous Action Chunks in Latent Space

Kunyun Wang (wkykaixin@sjtu.edu.cn)^1,2,* Yuhang Zheng (zhengyh021@gmail.com)^2,3 Yupeng Zheng (zhengyupeng2022@ia.ac.cn)^2,4 Jieru Zhao (zhao-jieru@sjtu.edu.cn)^1,† Wenchao Ding (dingwenchao@fudan.edu.cn)^2,5,†

¹School of Computer Science, Shanghai Jiao Tong University ²TARS Robotics ³National University of Singapore
⁴Institute of Automation, Chinese Academy of Sciences ⁵Fudan University

^*Work done during an internship at TARS Robotics. ^†Corresponding authors.

ICML 2026

Code arXiv

Lay Summary

Imitation learning policies control robots by predicting short sequences of future actions, known as action chunks, and then executing these actions on the robot. Increasing the action frequency can make robot motion smoother by reducing the stop-and-go behavior often seen in low-frequency execution, allowing the robot to move with more stable velocities. However, high-frequency actions are also harder for policies to learn, because they contain denser temporal information and finer spatial variations.

In this work, we propose learning high-frequency action chunks in a latent space, which provides a more compact and structured representation of motion. This helps the policy generate smoother and more consistent action sequences. To further reduce execution stalls caused by model inference latency, we use asynchronous inference, where the robot continues executing current actions while the policy predicts the next action chunk. However, asynchronous inference can introduce discontinuities when switching between chunks. To address this, we propose Reuse-then-Refine, a method that reduces boundary gaps between consecutive chunks and improves execution smoothness.

In real-robot experiments, our method reduces jerky motions and execution stalls across several contact-rich manipulation tasks. Overall, this work helps make learned robot policies more reliable for continuous physical interaction.

All videos are muted. Click any video to play; right-click for native browser controls. Videos play independently — pause one without affecting the others.

Part 1 · Synchronous Inference

Latent space enables smooth high-frequency control

High-frequency actions are required for continuous robot motion with stable velocities, but learning them directly in action space is unstable. We show that learning in a continuous latent space recovers smooth, stable execution.

Motivation & Method

Before diving into the comparisons, three key ideas:

Motivation I: action frequency shapes execution — Motivation I — Low-frequency actions create distant targets and stop-and-go motion; high-frequency actions enable continuous, servo-like control.

Motivation II: high-frequency actions are harder to learn — Motivation II — Learning high-frequency actions directly in action space amplifies quantization errors and jitter; naive upsampling from low-frequency predictions is smoother but imprecise.

Method I: learning high-frequency actions in latent space — Method I — A VAE compresses high-frequency action chunks into a temporally downsampled continuous latent space. The policy predicts latents; the decoder reconstructs precise, smooth high-frequency action chunks.

1.1 Wipe Vase (synchronous)

Continuous-contact wiping motion — smoothness directly visible in trajectory and contact quality.

DP — 4-way comparison: frequency & representation

Same task, same policy (Diffusion Policy), four action representations.

DP 15 Hz

low frequency, stop-and-go

DP interpolated 60 Hz

upsampled from 15 Hz, jittery

DP original 60 Hz

direct high-freq learning, jittery

DP latent 60 Hz OURS

high-freq in latent space, smooth & stable

Takeaway: 15 Hz → too slow; interpolation → still jittery; direct 60 Hz → still jittery; latent 60 Hz → truly smooth.

OFT — Latent vs Original

OFT original

high jerk, less smooth

OFT latent OURS

smoother motion

PI0.5 — Latent vs Original

PI0.5 original

high jerk

PI0.5 latent OURS

smoother motion

1.2 Write Board (synchronous)

Fine-motor writing task — trajectory precision and continuity are visually obvious in the handwriting output.

OFT — Latent vs Original

OFT original

large jerk, stalls

OFT latent OURS

smoother writing motion

PI0.5 — Latent vs Original

PI0.5 original

less smooth

PI0.5 latent OURS

smoother writing motion

1.3 Peel Cucumber (synchronous)

Contact-rich, force-sensitive task — stop-and-go produces visible force disturbances.

DP — 3-way high-frequency comparison

DP interpolated 60 Hz

upsampled, jittery

DP original 60 Hz

direct high-freq, jittery

DP latent OURS

smooth & stable

PI0.5 — Latent vs Original

PI0.5 original

high jerk

PI0.5 latent OURS

smoother peeling motion

Part 2 · Asynchronous Inference

Reuse-then-Refine removes chunk-boundary gaps

Latent space solves intra-chunk smoothness. Under asynchronous inference (which hides inference latency), a new problem appears: chunk-boundary discontinuities. RTR solves this with a training-free reuse-and-refine strategy. Combined latent-representation and RTR, our work achieves real-time smooth control.