Learning High-Frequency Continuous Action Chunks in Latent Space

Kunyun Wang (wkykaixin@sjtu.edu.cn)1,2,*Yuhang Zheng (zhengyh021@gmail.com)2,3Yupeng Zheng (zhengyupeng2022@ia.ac.cn)2,4Jieru Zhao (zhao-jieru@sjtu.edu.cn)1,†Wenchao Ding (dingwenchao@fudan.edu.cn)2,5,†

1School of Computer Science, Shanghai Jiao Tong University  2TARS Robotics  3National University of Singapore
4Institute of Automation, Chinese Academy of Sciences  5Fudan University

*Work done during an internship at TARS Robotics. Corresponding authors.

ICML 2026

Lay Summary

Imitation learning policies control robots by predicting short sequences of future actions, known as action chunks, and then executing these actions on the robot. Increasing the action frequency can make robot motion smoother by reducing the stop-and-go behavior often seen in low-frequency execution, allowing the robot to move with more stable velocities. However, high-frequency actions are also harder for policies to learn, because they contain denser temporal information and finer spatial variations.

In this work, we propose learning high-frequency action chunks in a latent space, which provides a more compact and structured representation of motion. This helps the policy generate smoother and more consistent action sequences. To further reduce execution stalls caused by model inference latency, we use asynchronous inference, where the robot continues executing current actions while the policy predicts the next action chunk. However, asynchronous inference can introduce discontinuities when switching between chunks. To address this, we propose Reuse-then-Refine, a method that reduces boundary gaps between consecutive chunks and improves execution smoothness.

In real-robot experiments, our method reduces jerky motions and execution stalls across several contact-rich manipulation tasks. Overall, this work helps make learned robot policies more reliable for continuous physical interaction.

All videos are muted. Click any video to play; right-click for native browser controls. Videos play independently — pause one without affecting the others.

Part 1 · Synchronous Inference

Latent space enables smooth high-frequency control

High-frequency actions are required for continuous robot motion with stable velocities, but learning them directly in action space is unstable. We show that learning in a continuous latent space recovers smooth, stable execution.

Motivation & Method

Before diving into the comparisons, three key ideas:

Motivation I: action frequency shapes execution
Motivation I — Low-frequency actions create distant targets and stop-and-go motion; high-frequency actions enable continuous, servo-like control.
Motivation II: high-frequency actions are harder to learn
Motivation II — Learning high-frequency actions directly in action space amplifies quantization errors and jitter; naive upsampling from low-frequency predictions is smoother but imprecise.
Method I: learning high-frequency actions in latent space
Method I — A VAE compresses high-frequency action chunks into a temporally downsampled continuous latent space. The policy predicts latents; the decoder reconstructs precise, smooth high-frequency action chunks.

1.1 Wipe Vase (synchronous)

Continuous-contact wiping motion — smoothness directly visible in trajectory and contact quality.

DP — 4-way comparison: frequency & representation

Same task, same policy (Diffusion Policy), four action representations.

DP 15 Hz

low frequency, stop-and-go

DP interpolated 60 Hz

upsampled from 15 Hz, jittery

DP original 60 Hz

direct high-freq learning, jittery

DP latent 60 Hz OURS

high-freq in latent space, smooth & stable

Takeaway: 15 Hz → too slow; interpolation → still jittery; direct 60 Hz → still jittery; latent 60 Hz → truly smooth.

OFT — Latent vs Original

OFT original

high jerk, less smooth

OFT latent OURS

smoother motion

PI0.5 — Latent vs Original

PI0.5 original

high jerk

PI0.5 latent OURS

smoother motion

1.2 Write Board (synchronous)

Fine-motor writing task — trajectory precision and continuity are visually obvious in the handwriting output.

OFT — Latent vs Original

OFT original

large jerk, stalls

OFT latent OURS

smoother writing motion

PI0.5 — Latent vs Original

PI0.5 original

less smooth

PI0.5 latent OURS

smoother writing motion

1.3 Peel Cucumber (synchronous)

Contact-rich, force-sensitive task — stop-and-go produces visible force disturbances.

DP — 3-way high-frequency comparison

DP interpolated 60 Hz

upsampled, jittery

DP original 60 Hz

direct high-freq, jittery

DP latent OURS

smooth & stable

PI0.5 — Latent vs Original

PI0.5 original

high jerk

PI0.5 latent OURS

smoother peeling motion

Part 2 · Asynchronous Inference

Reuse-then-Refine removes chunk-boundary gaps

Latent space solves intra-chunk smoothness. Under asynchronous inference (which hides inference latency), a new problem appears: chunk-boundary discontinuities. RTR solves this with a training-free reuse-and-refine strategy. Combined latent-representation and RTR, our work achieves real-time smooth control.

Method — Reuse-then-Refine (RTR)

Method II: Reuse-then-Refine for chunk-level continuity
Method II — RTR reuses recently executed actions, concatenates them with the still-valid portion of the new chunk, and refines the result through the VAE (~2 ms). Latent space ensures chunk-internal smoothness; RTR ensures chunk-to-chunk continuity.

2.1 Wipe Vase (asynchronous)

DP — Original vs Latent vs Latent+RTR

DP original 60 Hz (async)

chunk-boundary gaps, stalls

DP latent (async)

smoother but boundary jumps

DP latent + RTR (async) OURS

continuous, gap-free

OFT — Original vs Latent vs Latent+RTR

OFT original (async)

large chunk-boundary gaps

OFT latent (async)

smoother but boundary jumps

OFT latent + RTR (async) OURS

continuous, gap-free

PI0.5 — 4-way comparison (with RT-C baseline)

RT-C is a real-time control baseline; even with RT-C, RTR still provides smoother transitions.

PI0.5 original (async)

chunk-boundary gaps

PI0.5 RT-C (async)

real-time control baseline

PI0.5 latent (async)

smoother but boundary jumps

PI0.5 latent + RTR (async) OURS

continuous, gap-free

2.2 Write Board (asynchronous)

DP — 5-way comprehensive comparison (frequency + RTR)

The most comprehensive group: demonstrates both contributions of the paper jointly — the frequency problem and the RTR solution.

DP 15 Hz (async)

low frequency, stop-and-go

DP interpolated 60 Hz (async)

upsampled, jittery

DP original 60 Hz (async)

direct high-freq, jittery

DP latent 60 Hz (async)

latent space, smoother

DP latent + RTR 60 Hz (async) OURS

latent + RTR, seamless

Takeaway: low-freq → interpolation → direct high-freq → latent → latent + RTR. The full evolution from stop-and-go to seamless real-time control.

OFT — Original vs Latent vs Latent+RTR

OFT original (async)

large gaps at chunk boundaries

OFT latent (async)

smoother but boundaries jump

OFT latent + RTR (async) OURS

continuous, gap-free writing

PI0.5 — 4-way comparison (with RTC baseline)

PI0.5 original (async)

chunk-boundary gaps

PI0.5 RTC (async)

real-time control baseline

PI0.5 latent (async)

smoother but boundary jumps

PI0.5 latent + RTR (async) OURS

continuous, gap-free

2.3 Peel Cucumber (asynchronous)

DP — 4-way comparison

DP interpolated 60 Hz (async)

upsampled, jittery

DP original 60 Hz (async)

direct high-freq, jittery

DP latent (async)

smoother but boundary jumps

DP latent + RTR (async) OURS

continuous, gap-free

PI0.5 — 4-way comparison (with RTC baseline)

PI0.5 original (async)

chunk-boundary gaps

PI0.5 RTC (async)

real-time control baseline

PI0.5 latent (async)

smoother but boundary jumps

PI0.5 latent + RTR (async) OURS

continuous, gap-free