AV1 for robotics AI streaming, training and storage.

Community Article Published August 27, 2025

In this blog, we'll explain why AV1 as single frame instead of mp4 videos is a great format to unify:

  • Performant image storage for dataset collection
  • Fast batch image decoding for training
  • Fast image streaming for inference

tldr: AV1 Performance Gains

๐Ÿ“ก Streaming: Streaming AV1 packet for rgb and depth is better than alternatives like .jpeg or .tiff

  • 60% smaller files/bitrate for rgb+depth (0.32MB vs 0.75MB @ 720p) compared to closest alternatives
  • 2x higher frame rates (25fps vs 12fps @ 720p)
  • 2x lower latency for RGB+Depth (250ms vs 525ms @ 720p)

๐Ÿง  Training: Batch reading .avif files instead of loading .mp4 videos and retrieving random frames is way faster

  • 4x faster than TorchCodec using single .avif image with a parallel image reader instead of videos
  • Reading single .avif frame is a lot more simple than mp4 as it removes ffmpeg dependency.
  • Enable just-in-time dataset image downloading allowing to not wait videos download to start training.

๐Ÿ’พ Storage: Replacing .mp4 videos with .avif single file add more granularity at the cost of more storage

  • Enable more variety of dataset like continuous robotic data or streamed data that does not have clear start and end of episodes.
  • Enable frame metadata like GPS data, Camera orientation, FOV, ... for each camera and each frame.
  • Unfortunately, 2x more voluminous than lerobot .mp4 videos with the same data.

๐Ÿ“ก Streaming: Breaking the Bandwidth Barrier

Image streaming using AV1 provides state of the art bitrate for both rgb and depth data making it possible to reduce bandwidth, reduce latency, and increase fps for robotics. This enables putting robot in outdoor environment where 5G network can only provide limited bandwidth.

For the sake of simplicity, I have compared .AV1 encoding to simpler format like .jpeg for rgb and .tiff for depth ( as jpeg does not allow higher level bit depth ) and showcase that AV1 provides significant streaming gains from encoding lossy depth data.

RGB Image

File Size Reduction: 36% Smaller

Resolution JPEG AV1 Savings
720p 0.5 MB 0.32 MB 36%
480p 0.32 MB 0.22 MB 31%

Frame Rate Boost @ 10 Mbps

Resolution JPEG AV1 Improvement
720p 20 fps 32 fps +60%
480p 32 fps 45 fps +41%

Latency Reduction @ 20fps

Resolution JPEG AV1 Faster By
720p 50ms 32ms 36%
480p 32ms 22ms 31%

Depth Image (12bit)

Format 720p 480p
AV1 (12bit lossy) 13 KB 6 KB
TIFF (lossless) 266 KB 85 KB
RAW 14 MB 4 MB

12-bit = 4096mm = 4m depth precision making it more than enough for most depth camera and most robotic manipulation use case.

AV1 is the only format that delivers on monochrome 12bit as the previous generation VP9 was not able to do so.

RGB + Depth Combined

Resolution Metric AV1 JPEG AV1 Advantage
720p Size 333 KB 750 KB 60% smaller
Max FPS 25 fps 12 fps 2.1x faster
Data Transfer Latency 42ms 250ms 83% lower
480p Size 226 KB 400 KB 45% smaller
Max FPS 38 fps 18 fps 1.9x faster
Data Transfer Latency 26ms 100ms 74% lower

End-to-End Latency (with encoding/decoding dependent on hardware)

Resolution AV1 Total JPEG Total Improvement
720p 250ms 525ms 2.1x faster
480p 120ms 225ms 1.9x faster

My Code:


๐Ÿง  Training: 4x Faster Performance

Decoding single image .avif files instead of decoding videos provides speed up in training when compared with state of the art videos decoder like Torchcodec.

And .mp4 videos are actually not optimized for random read which make it slow to retrieve the actual position of the frame within the .mp4 file.

But, if we save each image as single files and then use filesystem to read them, we will have zero penalty for random frame access.

I wrote a rust based parallel random read & decode python extension, called images-rs, to showcase those speedup.

Batch Size Performance (720p @ 30fps)

Frames images-rs TorchCodec Speedup
10 46ms 108ms 2.3x
32 100ms 346ms 3.5x
64 163ms 652ms 4.0x

Note: Torchcodec search for the frame in the mp4, images-rs search for images within an image folder containing each frame as a single image.

image/png

image/png

image-rs = Rust Parallel Image Reader

Higher FPS is better

My Code:


๐Ÿ’พ Storage: The Trade-off

Storing single file images takes more storage than mp4, about 2x more than lerobot mp4 format.

Although, this can also be seen as a feature as it makes it easier to:

  • Enable streaming single images instead of having to download the mp4 videos. Mp4 streaming in the likes of TorchCodec need to partially download the mp4 until reaching the random image keyframe, making it a lot more heavy on ingress cost.
  • Enable longer length episodes or even 24/7 robotics data collection without depending on a single file to hold all the data.
  • Enable finegrained metadata encoded in each images so that we can for example store GPS data for each image, GPS orientation, FOV, Focus, lighting, ... I believe that those fine details at the camera and frame level can have a great impact on the performance of the model.

The additional storage cost should be balanced with reduced GPU data loading time and reduced ingress cost when compared to mp4 streaming.

My Code:

Benchmarked on rtx5080 with 24cpus. Results vary by implementation.

Community

Good work Tao,

  1. What's the CPU time for AV1 if GPU is not being used?
  2. Did you find ways to reduce the storage for AV1 when it streamed and loaded into the device which has constrained storage and memory?
Article author

GPU is never used.

Decoding is actually very fast.

I mean for streaming you might not need store it ?

But things you can do to improve storage is:

  • batch frame within 1 avif file.
  • Tweak avif settings depending on what you need.

Sign up or log in to comment