Humanoid robot data · 14 June 2026

Data Modalities for Robot Training: What Each Signal Teaches

Robot training data is often sold as "multimodal," but more sensors do not automatically make a dataset better. A modality is useful when it explains a part of the episode the model needs to learn: what the robot saw, what state its body was in, what action was commanded, what contact occurred, what the task meant, and whether the attempt worked.

That is the difference between a nice demo clip and a training-ready robot dataset. Video can show that a drawer opened. Robot state and actions explain how the arm moved. Force or tactile data can show when the gripper caught, slipped, or overloaded. Language and outcome labels tell the model what the episode was trying to accomplish.

If you are new to the category, start with the broader guide to humanoid robot training data. This article goes one layer deeper: which data modalities matter, what each one teaches, and what buyers should ask before treating a modality list as evidence of quality.

The useful modality map

A practical robot training dataset usually combines several kinds of signal:

Vision: RGB video, depth, stereo, segmentation, point clouds, wrist views, head views, external views, or scene reconstructions.
Robot state: joint positions, velocities, torques, motor state, gripper opening, hand pose, end-effector pose, base pose, IMU, odometry, and coordinate frames.
Actions: the command sent to the robot or simulator, such as joint targets, Cartesian deltas, gripper commands, base velocity, whole-body commands, or low-level controls.
Contact: force-torque readings, tactile images, pressure maps, slip signals, collision events, and contact timing.
Task context: language instructions, object labels, scene metadata, subtask boundaries, success/failure labels, interventions, and recovery notes.
Provenance: capture method, robot embodiment, operator interface, simulator, timestamps, calibration, license, consent, and known limitations.

The strongest datasets keep these signals aligned on one episode timeline. A weak dataset may technically include many modalities, but still be hard to use because frames, actions, labels, and sensor logs cannot be replayed together.

Vision is the robot's view, not just media

Vision is the most familiar modality, and for good reason. RGB video, depth, stereo, and multi-camera observations help a model understand objects, surfaces, human spaces, task progress, and visual feedback.

For robot learning, the camera viewpoint matters. An external camera may make a demonstration easy for a human to watch. A head, chest, or wrist camera shows what the robot actually observed while acting. Manipulation datasets often need wrist views because the hand can block external cameras at the exact moment contact happens.

Public datasets make the range clear. DROID focuses on in-the-wild manipulation with synchronized visual streams and robot data from a standardized Franka setup. Open X-Embodiment pools many robot datasets, but the visual modalities vary across contributors. AgiBot World uses richer humanoid and bimanual collection stacks with multiple cameras, depth, and tactile-capable hardware.

Buyer questions:

Which cameras were used, and where were they mounted?
Are RGB, depth, stereo, or point-cloud streams synchronized with actions and state?
Are camera intrinsics, extrinsics, frame rates, dropped frames, and coordinate transforms documented?
Does the camera perspective match the target robot's deployment view?

Vision alone is enough for perception work. For policy learning, it usually needs to be paired with state and action streams.

State and proprioception make behavior trainable

Robot state is the body's internal record. It can include joint positions, velocities, torques, gripper state, end-effector pose, base pose, IMU readings, battery or motor status, and controller-level metadata.

This modality is easy to underrate because it is less visual than video. It is also where many robot datasets become usable. If a clip shows a robot lifting a cup but the dataset does not include the robot's joint state, gripper state, control rate, and coordinate frames, the engineering team has to infer too much.

For humanoids, state is especially important because the body is not just an arm on a table. Balance, head pose, torso motion, foot placement, arm coordination, and hand configuration can all change whether a demonstration transfers to another robot. The broader guide to humanoid robot data collection equipment covers the sensors and logging stacks behind this layer.

Buyer questions:

Are state fields named, typed, timestamped, and documented?
Are positions, velocities, torques, and commanded targets separated?
Are coordinate frames, joint limits, robot model, hand model, and control frequency included?
Can one sample episode be replayed with state, video, and actions aligned?

State is not glamorous, but it is often the difference between a dataset that can be trained on and a dataset that can only be watched.

Actions are not the same as motion

Actions describe what command was sent, not only what movement was observed afterward. That distinction matters because imitation learning, diffusion policies, vision-language-action models, and many evaluation pipelines need a clear input-output relationship: observation in, action out.

Action data can be joint targets, Cartesian deltas, gripper commands, base controls, end-effector poses, or whole-body commands. In teleoperation data, it may also include operator hand poses, headset pose, leader-arm commands, controller inputs, haptic device state, or retargeted commands.

ALOHA and Mobile ALOHA are useful public references because they make the demonstration-to-action relationship explicit in bimanual and mobile manipulation settings. Humanoid datasets push the same issue further: a whole-body teleoperation episode should explain what the operator controlled, how those controls were mapped to the robot, and which signals are policy targets versus auxiliary context.

Buyer questions:

What is the action space, and is it absolute, relative, joint-space, Cartesian, low-level, or high-level?
Are actions the commands sent to the controller or values reconstructed after the fact?
Is there latency between operator input, robot command, and observed motion?
Were failures, corrections, and interventions preserved, or trimmed from the dataset?

Motion shows what happened. Actions explain what the learning system is expected to predict.

Force and tactile data reveal contact

Many manipulation failures are contact failures. The object slips. The gripper pinches too hard. A soft package deforms. A drawer sticks. A finger collides with an edge before the camera makes the problem obvious.

Force-torque and tactile modalities help with that missing layer. They can record contact timing, pressure distribution, slip, load, collision, and deformation. For gripper-heavy or dexterous-hand tasks, this can be more important than another camera angle.

RH20T is a useful broad example because its public materials describe RGB-D, robot state, action, force-torque, audio, human demonstration video, language descriptions, and a fingertip tactile subset. Newer tactile-language-action and visuotactile preprints show where the field is heading, but tactile data is still less standardized than camera, state, and action data.

Buyer questions:

What exactly is measured: force, torque, pressure, tactile image, slip, vibration, or contact event?
Where is the sensor mounted, and what surface or fingertip does it represent?
What are the units, sampling rate, calibration method, and saturation limits?
Does the dataset include negative contact examples, such as slips, failed grasps, collisions, or excessive force?

Tactile and force data should not be treated as universal premium features. They are most valuable when the task is contact-rich and the sensor is close enough to the interaction to explain something vision cannot.

Language turns episodes into tasks

Language can be a task instruction, a natural-language description, a subtask label, a correction note, or a success/failure explanation. It matters because many modern robot policies are conditioned on language, and because language makes datasets easier for humans to search, inspect, and reuse.

The quality bar is higher than a caption pasted onto a clip. A useful language layer should tell the reader what the robot was asked to do, what objects mattered, when subtasks started and ended, and whether the result matched the instruction.

BridgeData V2 and LIBERO helped make language-conditioned manipulation mainstream in open robot learning. Large multimodal policies such as OpenVLA, Octo, and NVIDIA's Isaac GR00T also show why image, language, state, and action are now discussed together rather than as separate dataset products.

Buyer questions:

Are labels written before the task, after the task, or generated automatically?
Are task names, natural-language instructions, substeps, and outcomes stored separately?
Are failures labelled as failures, or only successful episodes included?
Are object names, scene context, and instruction wording consistent enough for training or retrieval?

Language makes a dataset more legible. It does not repair missing state, missing actions, poor synchronization, or unclear rights.

Simulation and synthetic data add coverage, with a transfer bill

Simulation can generate rare events, structured labels, clean segmentation, depth, object state, and large numbers of demonstrations without putting hardware at risk. Synthetic data can also help balance a dataset that lacks enough examples of a task, object, or failure mode.

The catch is transfer. A simulated episode can be perfectly labelled and still fail to match real robot friction, lighting, sensor noise, material deformation, contact dynamics, or human clutter.

MimicGen is a strong example of generating many manipulation demonstrations from a smaller set of human examples. ManiSkill shows how simulation environments can produce RGB-D, segmentation, state, and demonstrations at scale. NVIDIA's GR00T materials position synthetic generation as part of a broader humanoid training stack, not as a replacement for real teleoperation and robot logs.

Buyer questions:

Which simulator, assets, physics settings, and controllers generated the data?
Were trajectories seeded by human demonstrations, scripted policies, model rollouts, or random exploration?
Which labels are simulator-native, and which were added later?
Has the seller measured transfer to real hardware, or is the dataset only a pretraining or benchmark asset?

Synthetic data is useful when its provenance is clear and its role is honest. It is risky when it is sold as real-world coverage without evidence of transfer.

Provenance and packaging are modalities too

Provenance is not always listed as a modality, but it behaves like one in buyer diligence. It explains where the data came from, who captured it, what rights travel with it, and what assumptions are safe.

For commercial robot training data, this includes:

Capture method: teleoperation, autonomous rollout, scripted control, human video, motion capture, simulation, or mixed collection.
Embodiment: robot model, hand or gripper, degrees of freedom, sensors, controller, and task setup.
Timing: timestamps, clock sources, synchronization method, dropped frames, and latency notes.
Calibration: camera parameters, transforms, coordinate frames, sensor calibration, and retargeting method.
Rights: collector, owner, license, consent, privacy treatment, third-party environments, and redistribution limits.
Packaging: schema docs, sample files, loaders, versioning, checksums, and known limitations.

The LeRobot dataset format is one sign of the market maturing. A common episode structure makes it easier to inspect video, state, action, timestamps, and metadata without reverse-engineering every capture stack from scratch. Format standardization does not guarantee quality, but it reduces the work needed to find out.

For a deeper diligence checklist after sample files arrive, use the guide to evaluating humanoid robot training data before you buy.

How to compare modality stacks

Do not score a dataset by counting modalities. Score it by asking what each modality explains.

Useful buyer prompts:

If the dataset has video, does it also have the state and action streams needed to train behavior?
If it has robot state, are the fields documented well enough to map onto the target embodiment?
If it has tactile or force data, does the task actually depend on contact?
If it has language, are instructions and outcomes structured, or only loosely captioned?
If it has simulation, what real-world gap is it meant to fill?
If it has many sensors, can one episode be replayed on a single aligned timeline?

Sellers should prepare the same information before a buyer asks. A strong dataset summary should say: this is the robot, these are the tasks, these are the modalities, this is how they are synchronized, this is what each signal means, these are the rights, and these are the limitations.

That is the honest market signal. "Multimodal" is a starting point. Training-ready means the modalities are aligned, documented, legally usable, and relevant to the robot and task.

Humanoids Data helps buyers find robotics training datasets with the right modality mix, and helps sellers present those datasets in a way technical teams can evaluate. If you need robot training data, use the buyer request form. If you have robotics data to license, use the seller submission form.