arxiv:2503.03734

OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Published on Mar 5

· Submitted by

mlfu7 on Mar 12

Upvote

Authors:

Letian Fu ,

Abstract

Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained visionlanguage models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zeroshot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.

View arXiv page View PDF Add to collection

Community

mlfu7

Paper author Paper submitter about 8 hours ago

•

edited about 8 hours ago

A easy way to get instruction following capability for vision language action models without breaking your bank on training / data collection :) Code and datasets are fully open-sourced!

400M Pretrained CLIP (frozen) + ~20/30M policy network, trainable on a single workstation within 12 hrs. Works on small scale dataset (<1000 trajectories on a few different tasks).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.03734 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.03734 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.