Yuhan Wu1*  Zhao Jin2*  Tao Li2  Yuheng Zhang2  Zhichao Wang1  Shuo Wang1
Jun Jiang1  Xiaobo Li1†  Yanyong Zhang1†  Jian Tang2†  Zhengping Che2†  Yan Xia1†
USTC Beijing Innovation Center of Humanoid Robotics
1University of Science and Technology of China 2Beijing Innovation Center of Humanoid Robotics
* Equal contribution † Corresponding authors
yan.xia@ustc.edu.cn, xiaoboli@ustc.edu.cn, z.che@x-humanoid.com, jian.tang@x-humanoid.com, yanyongz@ustc.edu.cn

From Real Chemistry to Precision Benchmarking

The first benchmark for humanoid dexterous manipulation in precision-critical laboratory environments. Over 30 functionally faithful assets covering the core operations of routine organic chemistry experiments, with articulated instruments, particle-based powder physics, and closed-loop instrument readouts enabling a complete manipulation-to-measurement pipeline.

6 Atomic Operations + 1 Procedural Workflow

All tasks are derived from real organic chemistry standard operating procedures.

1. Door Open

Pinch-and-slide grasp on the balance windshield door handle along prismatic joint.

DiscreteSingle-ArmInstrument

2. Grasp & Place

Precision placement of weighing boat on balance pan center. Position error ≤ 15 mm.

DiscreteSingle-ArmPrecision: ≤15mm

3. Door Close

Close the balance windshield door, reversing the open operation along the prismatic joint.

DiscreteSingle-ArmInstrument

4. Tare Press

Single-finger extension to press the tare button, requiring dexterous hand control.

DiscreteSingle-ArmInstrumentDexterous Required

5. Tool Pickup

Pick up the spatula, requiring sustained contact and controlled grasp force.

SustainedSingle-Arm
⚖️
Video Coming Soon

6. Scoop & Weigh

Bimanual powder scooping with fine-force control. Target: 0.850 ± 0.001 g.

SustainedBimanualInstrumentDexterous RequiredPrecision: ≤0.001g

Solid Weighing SOP Workflow

7 steps composing 6 atomic operations · Precision target: 0.850 ± 0.001 g

Step 1
Door Open
Step 2
Grasp & Place
Step 3
Door Close
Step 4
Tare Press
Step 5
Door Open
Step 6
Tool Pickup
Step 7
Scoop & Weigh
C1

First Benchmark for Humanoid Dexterous Manipulation in Precision-Critical Chemical Laboratories

Six atomic operations and a complete seven-step solid-weighing workflow derived from real laboratory SOPs.

C2

High-Fidelity Laboratory Simulation for Humanoid Manipulation

Over 30 functionally faithful assets with articulated instruments, particle-based powder physics, and closed-loop instrument readouts, providing realistic interaction and measurement.

C3

Precision-Aware Evaluation Protocol for Humanoid Laboratory Manipulation

Multi-level evaluation jointly measuring task completion, experimental precision, and long-horizon execution. Reveals the gap between completion and experimental validity.

0
Functional Assets
0
Atomic Operations
0
Evaluation Hierarchy
0
Perturbation Conditions

The First Benchmark for Humanoid Dexterous Laboratory Manipulation

No existing benchmark combines humanoid dexterous hands, precision-critical laboratory tasks, and quantitative evaluation.

BenchmarkEmbodimentProtocol GroundingScientific Manip.Evaluation
Hum.Dex.LabSOPHier.Constr.Bi.Instr.ToolMat.StepPrec.
RLBench
RoboCasa
ManiSkill 3
LIBERO
RoboTwin 2
GenieSim 3
Factory
DexJoCo
LabUtopia
Chemistry3D
AutoBio
MATTERIX
Labimus (Ours)
Supported Partial Not supported

High-Fidelity Laboratory Simulation

Simulating laboratory tasks is more demanding than home or tabletop tasks. The powder must be transferred as measurable mass, the balance must update its reading in real time, and the task is scored against a protocol tolerance of ±0.001 g.

Functional Assets

Over 30 objects in three categories: containers, tools, and instruments. All sourced from ArtVIP with high-quality meshes, realistic textures, and physically accurate collision geometry.

Functional assets

Powder Physics & Weighing

Each powder grain is a small rigid body resolved by PhysX. The closed-loop path from particle physics through mass computation to digital readout grounds the precision metric. Success: mass within ±0.001 g.

Powder physics and weighing

Three-Tier Evaluation × Four Conditions

Task completion is a necessary but insufficient criterion for laboratory manipulation. Each tier adds a new evaluation dimension that reveals failure modes invisible at the previous tier.

1
Binary Success S

Task Completion

All 6 atomic operations evaluated with binary success only. Did the intended state change occur? Establishes a completion-rate baseline.

2
S + Precision P

Quantitative Precision

Operations with SOP tolerances re-evaluated with continuous precision metrics (e.g., ±0.001 g, ≤15 mm). Same physical actions, different criteria — directly quantifying the precision gap.

3
S + P + Step Progress SP

Long-Horizon Precision

Complete 7-step solid-weighing workflow with step-level progress tracking and stage-level diagnostics. Compounding errors degrade precision at later steps.

Perturbation Conditions

All tasks are evaluated on procedural layouts (Standard condition) with three perturbations layered on top, forming a 3 × 4 evaluation matrix.

📍

Standard

Procedural layouts with default lighting and textures.

💡

+Lighting

Randomized color temperature (3000–7000K) and intensity (0.4–1.6x).

🎨

+Textures

Randomized benchtop (10 materials) and background (5 materials) via OmniPBR.

🔀

+Combined

All perturbation axes applied simultaneously, simulating realistic deployment.


Completion ≠ Precision

Policies that successfully complete laboratory tasks can still fail to satisfy the quantitative tolerances required by experimental protocols.

Tier 1: Task Success Rate S (%)

TaskACTDPπ0
Door open56.7 ± 3.449.3 ± 2.547.3 ± 7.4
Door close24.7 ± 0.96.7 ± 2.540.7 ± 3.8
Tare press2.0 ± 0.00.0 ± 0.00.0 ± 0.0

Mean ± std across 3 seeds (50 eps each).

Robustness: π0 Door Open (%)

ConditionMean ± Std
Standard47.3 ± 7.4
+Lighting41.3 ± 5.0
+Textures46.7 ± 4.1
+Combined40.0 ± 5.7

Combined perturbation: largest drop (−7.3 pp).

Tier 2: Completion ≠ Precision (ACT, Grasp & Place, ep ≤ 15 mm, %)

MetricSeed 1Seed 2Seed 3Mean ± Std
Success S4.06.06.05.3 ± 0.9
Precision P2.06.02.03.3 ± 1.9
62.5%
Pcond: only 5 of 8 completed episodes meet tolerance

Completion alone does not guarantee precision.
Binary success overestimates valid execution.


BibTeX

@article{wu2026labimus,
  title   = {Labimus: A Simulation and Benchmark for Humanoid
             Dexterous Manipulation in Chemical Laboratory},
  author  = {Wu, Yuhan and Jin, Zhao and Li, Tao and Zhang, Yuheng
             and Wang, Zhichao and Wang, Shuo and Jiang, Jun
             and Li, Xiaobo and Zhang, Yanyong and Tang, Jian
             and Che, Zhengping and Xia, Yan},
  journal = {arXiv preprint arXiv:2606.31037},
  year    = {2026}
}