We introduce EO-1 model, an open-source unified embodied foundation model comprising 3B parameters, trained on the carefully curated interleaved embodied dataset EO-Data1.5M, Web Multimodal Data, and Robot Control Data (AgiBotWorld, Open X-Embodiment, RoboMIND, SO100-Community, etc.). The EO-1 model adopt a single unified decoder-only transformer that integrates discrete auto-regressive decoding with continuous flow matching denoising for multimodal embodied reasoning and robot control, enabling seamless perception, planning, reasoning, and acting in single model. This work highlights the following features:
- β‘ Unified Architecture: A single decoder-only transformer integrating text, image, video, and actions.
- π EO-1.5M Dataset: 1.5M high-quality interleaved samples (Physical, Reasoning, Spatial, Control).
- π Interleaved Pretraining: Seamless synergy between language and action with autoregressive + flow matching.
- π€ Reasoning-Enhanced Generalization: Superior generalization capabilities with multimodal embodied reasoning and real robot control.
Clone the repository:
git clone https://github.com/EO-Robotics/EO-1.git
cd EO-1
Create a conda environment and install dependencies:
# create conda environment
conda create -n eo python=3.10
conda activate eo
pip install --upgrade setuptools
# install flash-attn 2
MAX_JOBS=4 pip install flash-attn==2.8.3 --no-build-isolation
# [recommended] βοΈ install flash-attn 3 from source with H100 / H800 GPU, CUDA 12.8 for best performance
# git clone https://github.com/Dao-AILab/flash-attn.git -b v2.8.3 --recursive --depth 1
# cd hopper && python setup.py install
pip install -e .
- Load Dataset and Customization - Learn how to load and customize datasets in LeRobot format
- Fine-tuning on Custom Data - Step-by-step guide for training EO-1 on your own data
- Evaluation and Deployment - Deploy trained models and run evaluations
- Advanced Pre-training - Large-scale pre-training workflows
- Demo Training - Quick start with demo data and debug mode
- Libero Benchmark - Tuning on Libero benchmark tasks
- SimplerEnv Benchmark - Tuning on SimplerEnv benchmark, including WidowX and Google Robot
- SO101 Tasks - SO100 collection manipulation tasks
- WidowX Platform - WidowX robot specific training and evaluation
- AgiBot Platform - AgiBot robot training and deployment
- Franka Platform - Franka robot manipulation tasks
- Vision-Language Evaluation - Multi-modal benchmark evaluation
- Large-scale Pre-training - Multi-stage pre-training with 128+ GPUs
EO-1 is built entirely on π€ HuggingFace Transformers and Lerobot, making deployment straightforward and accessible. If your environment supports transformers and lerobot, you can load the model and run inference directly with just a few lines of code (requires ~6.5GB GPU memory). EO-1 unifies high-level embodied reasoning with low-level robot control, producing either natural language outputs or actionable robot commands.
from transformers import AutoModel, AutoProcessor
# load model and processor
processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/EO-1-3B", trust_remote_code=True)
model = AutoModel.from_pretrained(
"IPEC-COMMUNITY/EO-1-3B",
trust_remote_code=True,
dtype=torch.bfloat16
).eval().cuda()
# prepare the model input
batch = {
"observation.images.image": [img],
"observation.images.wrist_image": [wrist_img],
"observation.state": [state],
"task": ["Pick up a red piece and place it at (0, 2)."]
}
# 1. action sampling [robot control]
output = processor.select_action(model, batch)
print(output.action)
# prepare conversation
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "demo_data/example2.png"},
{"type": "text", "text": "You are a helpful physical agent equipped with both reasoning and robotic control. \
You see the Tic-Tac-Toe board, think strategically, act logically, and block threats."},
],
},
]
# 2. text generation [multimodal reasoning]
input_length = inputs["input_ids"].shape[1]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1024, return_dict_in_generate=True)
generated_ids = outputs.sequences
text = processor.decode(generated_ids[0, input_length:])
print(text)
We use LeRobot as the primary source for robot control training and evaluation, with Any4LeRobot providing convenient data conversion and preprocessing utilities. For Multimodal data, e.g., image, video, text, points and bounding boxes, we follow the Qwen2.5-VL and Qwen2-VL-Finetune recipes. In interleaved pretraining, we integrate the EO-Data1.5M dataset β a large-scale, high-quality embodied dataset designed to unify reasoning and control. Data are organized in a standardized format as shown below:
Here, the `lerobot` and `view` fields connect actions with multimodal conversations, enabling the model to capture the rich temporal dynamics and causal dependencies among vision, language, and action modalities β a core requirement for robust performance in open-world embodied interactions. For more details, please refer to [getting_started/1_load_dataset](getting_started/1_load_dataset.ipynb).To combine robot control data and multimodal data, we support a flexible YAML-based configuration, where each dataset can be assigned weights and sampling strategies. This makes it easy to balance embodied control trajectories with multimodal reasoning data for interleaved training. For example:
# @multimodal data config
mm_datasets:
# classical multimodal data
- json_path: demo_data/refcoco/refcoco.jsonl # jsonl file
vision_base_path: demo_data/refcoco # base path for vision data files referenced in the JSONL
sampling_strategy: random:100% # sampling strategy
# interleaved data jsonl, rely on `lerobot_datasets` to load robot control data
- json_path: demo_data/interleaved_demo.jsonl
# @robot control config
lerobot_datasets:
- repo_id: demo25
root: ./demo_data
# Optional fields:
episodes: [1, 2, 3] # specific episodes to load (None = all)
train_subtask: mix:0.9 # mix sub-task instructions and overall instructions with 90% sub-task
delta_action: false # train with delta actions
state_mode: "MEAN_STD" # state normalization mode
select_video_keys: # which camera streams to load
[
observation.images.head,
observation.images.hand_left,
observation.images.hand_right,
]
select_state_keys: # proprioceptive states
[observation.states.joint.position, observation.states.effector.position]
select_action_keys: # action targets
[actions.joint.position, actions.effector.position]
effector_indices: [14, 15] # indices of effector channels in the flattened action vector
weight: 1.0 # dataset weight for sampling
EO-1, Mastering Diverse Manipulations on Multiple Embodiments, demonstrates its robustness and adaptability by performing a wide range of dexterous manipulation tasks across heterogeneous robotic platforms. We evaluate its performance on both short-horizon and long-horizon tasks, spanning Franka Panda, WidowX 250 S, AgiBot G-1, and LeRobot SO100.
To fine-tune EO-1 on your own embodiment, you only need to adapt the configuration file. Specifically, convert your dataset into the LeRobot format, then define the fields that describe where your videos, states, and actions are located. The following YAML snippet shows a typical setup:
# @multimodal data config
# leave empty if only robot control data
mm_datasets:
lerobot_datasets:
- repo_id: libero_spatial_no_noops_1.0.0_lerobot # replace with your dataset name
root: ./demo_data/ # replace with your dataset root path
select_video_keys: [
observation.images.image,
observation.images.wrist_image,
] # replace with your feature keys
select_state_keys: [observation.state]
select_action_keys: [action]
- repo_id: libero_90_no_noops_lerobot
root: HF_LEROBOT_HOME
# If not specified, uses all keys by default
Once your dataset is prepared and the configuration file (e.g., example.yaml) is set up, you can launch fine-tuning with the following command. We use torchrun to support distributed or multi-GPU training, while the arguments control training mode, optimization, and which model components to freeze or update. Please launch scripts to experiments/1_demo and experiments/2_liberoto start a demo training.
accelerate launch $ACCELERATE_ARGS scripts/train.py \
${model_name_or_path:+--model-name-or-path $model_name_or_path} \
${deepspeed:+--deepspeed configs/${deepspeed}.json} \
--vlm-name-or-path ../pretrained/Qwen2.5-VL-3B-Instruct \
--train-lerobot-only ${lerobot_only} \
--data-path ${dataset} \
--chunk-size ${chunk_size} \
--dataloader-num-workers ${data_num_workers} \
--freeze-vision-tower False \
--freeze-llm False \
--freeze-merger False \
--bf16 True \
--tf32 True \
--fp16 False \
--num-train-epochs ${epoch} \
--per-device-train-batch-size ${PER_DEVICE_BATCH_SIZE} \
--gradient-accumulation-steps 1 \
--learning-rate ${lr} \
--merger-lr ${mlr} \
--vision-lr ${vlr} \
--weight-decay 0.1 \
--warmup-ratio 0.03 \
--lr-scheduler-type cosine \
--logging-steps ${logging_steps} \
--gradient-checkpointing True \
--save-strategy steps \
--save-steps ${save_steps} \
--save-total-limit 3 \
--report-to ${report} \
--run-name ${run_name} \
--attn-implementation flash_attention_2
Mastering Diverse Manipulations on Multiple Embodiments. More details can be found in experiments/2_libero, experiments/3_simpler, and experiments/8_vllmeval.
Model | Franka Pick-and-Place (7 Tasks) | AgiBot Long-horizon Dexterity (4 Tasks) | WidowX Out-of-Box (13 Tasks) | Reasoning Control (4 Tasks) |
---|---|---|---|---|
|
0.610 | 0.449 | 0.227 | β |
0.831 | 0.672 | 0.693 | 0.525 | |
GR00T-N1.5 | 0.857 | 0.681 | 0.705 | 0.617 |
EO-1 | 0.935 | 0.807 | 0.852 | 0.831 |
Multi-modal Benchmark Results
Model | RoboVQA | ERQA | EO-Bench @ Spatial | EO-Bench @ Temporal | Overall |
---|---|---|---|---|---|
Claude 3.5 | 26.7 | 35.5 | 24.0 | 34.8 | 30.3 |
GPT-4o (2024-11-20) | 47.2 | 40.0 | 35.6 | 39.3 | 40.5 |
Qwen2.5 VL 3B | 55.9 | 35.3 | 20.0 | 22.6 | 33.5 |
Magma 8B | 30.3 | 29.3 | 29.4 | 36.7 | 31.4 |
EO-1 (3B) | 58.5 | 45.5 | 36.4 | 38.9 | 44.8 |
Robot Control Benchmark Results
Model | LIBERO | Simpler @ Google VM | Simpler @ Google VA | Simpler @ WidowX VM |
---|---|---|---|---|
0.942 | 0.714 | 0.714 | 0.692 | |
|
0.855 | 0.464 | 0.464 | 0.321 |
GR00T-N1 | 0.939 | β | β | β |
Magma | β | 0.488 | 0.488 | 0.448 |
EO-1 | 0.982 | 0.765 | 0.765 | 0.727 |
- π€ Release EO-1 pretraining, finetune scripts, and documentations.
- Integrate into LERobot. We have merged the PR into the main branch. You can now use EO-1 with LERobot without any modifications.
- π€ Release pre-training models, Interleaved Dataset
EO-Data1.5M
and benchmarkEO-Bench
. - β‘οΈ Efficient LLM Inference over Long Sequences, Efficient KV-cache, etc.
- π€ Integrate with human feedback fine-tuning.
We welcome contributions! Please check out CONTRIBUTING.md. Join our community on Discord.
If you find this project useful, please consider citing:
@article{eo1,
title={EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control},
author={Delin Qu and Haoming Song and Qizhi Chen and Zhaoqing Chen and Xianqiang Gao and Xinyi Ye and Qi Lv and Modi Shi and Guanghui Ren and Cheng Ruan and Maoqing Yao and Haoran Yang and Jiacheng Bao and Bin Zhao and Dong Wang},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2508.21112}
}
EO-1 is built with reference to the code of the following projects:
Thanks for their awesome work!