Pelican Logo Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence

WFM System Group
Beijing Innovation Center of Humanoid Robotics (X-Humanoid)
Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Jian Tang, Xiaozhu Ju
{vito.dai, jason.ju}@x-humanoid.com
Arxiv Code Hugging Face Logohuggingface ModelScope Logo ModelScope

Abstract

Pelican-VL 1.0 Demo

This report presents Pelican-VL 1.0, a new family of open-source embodied brain models with parameter scales ranging from 7 billion to 72 billion. Our explicit mission is clearly stated as: To embed powerful intelligence into various embodiments. Pelican-VL 1.0 is currently the largest-scale open-source embodied multimodal brain model. Its core advantage lies in the in-depth integration of data power and intelligent adaptive learning mechanisms. Specifically, metaloop distilled a high-quality dataset from a raw dataset containing 4+ billion tokens. Pelican-VL 1.0 is trained on a large-scale cluster of 1000+ A800 GPUs, consuming over 50k+ A800 GPU-hours per checkpoint. This translates to a 20.3% performance uplift from its base model and outperforms 100B-level open-source counterparts by 10.6%, placing it on par with leading proprietary systems on well-known embodied benchmarks.

Benchmark Performance

Pelican-VL 1.0 achieves strong benchmark performance, showing clear gains over ≤100B models and achieving SOTA results even against >100B models. Across our nine-dimension embodied-intelligence taxonomy, Pelican-VL 1.0 attains a well-balanced skill distribution and excels in key dimensions.

Comparison against models with ≤100B parameters

Comparison against models with >100B parameters

Comparison against models with >100B parameters

Comparison against models with ≤100B parameters

Pelican-VL 1.0 vs models >100B

Pelican-VL 1.0 vs models >100B

Pelican-VL 1.0 vs models ≤100B

Pelican-VL 1.0 vs models ≤100B

Model Performance Overview

Overall performance comparison on benchmarks. Bold and underlined numbers indicate the best and second-best results, respectively. A dagger (✝) marks results differing from official reports or unusually low, possibly as official evaluations used model-specific prompts and the models are prompt-sensitive, while our results are obtained under a unified protocol for fair comparison. An asterisk (*) denotes results reported from official sources. Yellow cells mark Pelican-VL 1.0.

Performance of Models with >100B Parameters
Performance of Models with ≤100B Parameters

Performance Evolution Trajectory Across Metaloop Training

The performance curves show that DPPO preserves general capabilities without catastrophic forgetting while progressively enhancing embodied intelligence. Pelican-VL steadily improves across both general and embodied benchmarks through alternating RL-driven weakness discovery and SFT-based consolidation.

Experimental Evolution

Downstream Applications

Generated Step-wise affordance

Pelican-VL 1.0 can generate consistent affordances for multi-view inputs, so that 2D affordances can be triangulated into 3D target points as direct robot action commands.

mid_rgb left_rgb right_rgb

(a) Raw multi-view inputs

view_2_annotated view_1_annotated view_0_annotated

(b) Visual grounding

20251104-194905

(c) 2D-to-3D affordance triangulation

task_1_step_1_view_2_modi task_1_step_1_view_1_modi task_1_step_1_view_0_modi

(d.1) Pick

task_1_step_2_view_2_modi task_1_step_2_view_1_modi task_1_step_2_view_0_modi

(d.2) Move

task_1_step_3_view_2_modi task_1_step_3_view_1_modi task_1_step_3_view_0_modi

(d.3) Place

Affordance manipulation generalizes effectively to novel objects and scenes

Pelican-VL-1.0 generates accurate multi-view visual grounding and affordance-based task de-compositions. Grasp points are highlighted in green,avoidance regions in yellow, and placement targets in pink.

Put the ed apple into the blue plate

Put the ed apple into the blue plate

Put the snack package into the blue plate

Put the snack package into the blue plate

Put the orange into the bamboo basket

Put the orange into the bamboo basket

Put the red apple into the bamboo basket

Put the red apple into the bamboo basket

Long-Horizon Task Reasoning and Planing

Pelican-VL 1.0 interprets a natural-language instruction and autonomously performs sequential actions—placing shoes on the rack, disposing garbage into the trash can, and loading clothes into the washing machine.

Pelican perceives the environment
Pelican receives human instruction
Task 1: Place shoes on rack
Task 2: Throw garbage into trash can
Task 3: Put clothes into washing machine
Environmental Perception 1
Environmental Perception 2
Environmental Perception 3
Environmental Perception 4
Human Icon

"Please put the shoes on the shoe rack, throw the garbage on the table into the trash can, and put the dirty clothes on the sofa into the washing machine."

Grab the shoes

Move to shoe rack

Place on shoe rack

Grab the shoes

Move to shoe rack

Place on the shoe rack

Task 1 completed

Task 1 completed

Go to the table

Go to the table

Grab the milk carton

Grab the milk carton

Throw into trash can

Throw into trash can

Task 2 completed

Task 2 completed

Move to the washer

Grab the dirty clothes

Put into washing machine

put them into washer

Close the door & Task 3 completed

Close the door

Grab the dirty clothes

Task 3 completed

Gentle Grasp for Delicate and Compliant Objects

The initial force prior provided by Pelican-VL 1.0 accelerates the convergence of the online adaptive controller.

Sponge wipe

Sponge wipe

Blue balloon

Blue balloon

Chips

Chips

Soft labobo

Soft labobo

Citation

If you use any source codes or the datasets included in this toolkit in your work, please cite the following paper. The bibtex are listed below:

@article{Pelican-VL-1.0, title={Pelican-VL 1.0: A Foundation Brain Model for Embodied Intelligence}, author={Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang Liu, Yue Zhao, Junbo Qi, Qinfan Zhang, Dengjie Li, Yidong Wang, Jiachen Luo, Yong Dai, Jian Tang and Xiaozhu Ju}, journal={arXiv preprint arXiv:2511.00108}, year={2025} }}