Dataset buying ยท 24 June 2026
Humanoid and Egocentric Data Marketplaces: Where to Source Physical AI Training Data
In this guide
- Market reality:
- Where to look:
- How to buy:
The market for humanoid and egocentric training data is real, but it is not a clean Amazon-style catalog yet. It is a fragmented supply map: a few dedicated physical AI marketplaces, broader data brokers with robotics listings, custom collection vendors, open dataset hubs, and academic egocentric datasets with licenses that may or may not work for commercial model training.
That distinction matters. A humanoid robotics team does not simply need "more video." It may need egocentric hand-object video, teleoperation demonstrations, synchronized robot state and actions, depth, tactile or force signals, language instructions, failure labels, or custom collection in a warehouse, kitchen, lab, clinic, retail store, workshop, or home.
If you are new to the category, start with what humanoid robot training data is and the guide to data modalities for robot training. This article is the sourcing map: where to look, which sources are open versus commercial, and what to verify before a dataset enters a training run.
Physical AI data is shaped by the capture stack: operator devices, retargeting, robot interfaces, sensors, and data writers. Source: NVIDIA Isaac Teleop documentation.
The short answer
There are five practical places to source humanoid, egocentric, and physical AI training data today.
Use dedicated robotics marketplaces when you want a buyer workflow around robot datasets, licensing, enrichment, and custom collection. Examples include Humanoids Data, HumanoidLayer, the Robotics Center of Silicon Valley marketplace, and Truelabel. These are closest to the thing people mean when they say "humanoid data marketplace."
Use commercial dataset brokers when you want listed products with prices, samples, or provider discovery. Datarade's robotics page lists egocentric video products, including first-person video datasets marketed for physical AI and robotics. Defined.ai lists robotics-domain datasets, including a 1,000-hour multimodal household robotics dataset and a 100-hour egocentric household-activities dataset. Luel is another rights-cleared multimodal marketplace and collection engine with egocentric-video positioning.
Use custom data vendors when the dataset you need does not exist yet. Scale AI Physical AI, Appen Physical AI, Toloka Robotics, DataForce, Claru's egocentric video data service, iMerit, micro1 Robotics, TELUS Digital Physical AI, Encord Physical AI, Nexdata, and Unidata all position themselves around collection, annotation, enrichment, or delivery of physical-world training data.
Use open robotics hubs for research, pretraining, benchmarks, and sometimes commercial training, after license review. The main hub is Hugging Face LeRobot, with direct releases from Unitree, NVIDIA, DROID, BridgeData, Open X-Embodiment, RoboMIND, Humanoid Everyday, 1X, AgiBot, Fourier, RH20T, ALOHA, DECO-50, RoboSet, FurnitureBench, and others.
Use academic egocentric datasets for first-person perception, hand-object interaction, daily-life activity, and skill understanding, but treat licenses as a central issue. Ego4D, Ego-Exo4D, Project Aria datasets, EPIC-KITCHENS, HoloAssist, HOI4D, HOT3D, Nymeria, Assembly101, EgoVerse, EgoSchema, EgoObjects, EgoPAT3D, and EgoDex are useful references. They are not automatically commercial training data.
The source map
| Source type | Examples | Best for | Main caveat |
|---|---|---|---|
| Dedicated robotics marketplaces | Humanoids Data, HumanoidLayer, Robotics Center, Truelabel. | Task-specific sourcing, enrichment, licensing workflows, teleoperation or robot-log supply. | Category is early. Inventory depth and seller rights need direct verification. |
| Commercial dataset brokers | Datarade, Defined.ai, Luel, FileMarket-style provider listings. | Discovering purchasable egocentric, household, robotics, LiDAR, and multimodal datasets. | Broker listings are starting points. Ask for provenance, samples, license terms, and consent records. |
| Custom collection vendors | Scale, Appen, Toloka, DataForce, Claru, iMerit, micro1, TELUS, Encord, Nexdata, Unidata. | New egocentric video, sensor-fusion, robot demonstrations, annotations, and environment-specific data. | You are buying a program, not only files. Specify rights, exclusivity, QA, and delivery format up front. |
| Open robotics hubs | LeRobot, DROID, Unitree, NVIDIA PhysicalAI, BridgeData, RoboMIND, Humanoid Everyday, 1X, Open X. | Research, pretraining, benchmarks, format examples, and some commercially usable permissive datasets. | Licenses vary per dataset and subset. Open does not mean usable for commercial training. |
| Academic egocentric datasets | Ego4D, Ego-Exo4D, Project Aria, EPIC-KITCHENS, HOT3D, Nymeria, Assembly101, HoloAssist, HOI4D, EgoVerse. | First-person perception, hand-object interaction, activity understanding, skill benchmarks. | Many are research-only, request-gated, privacy-sensitive, or missing robot action labels. |
The first buyer lesson is not "choose one." A serious humanoid data strategy usually uses several. Open datasets establish baselines and tooling. Academic egocentric data teaches perception and task structure. Commercial vendors fill gaps. Marketplaces help route demand to sellers who have data or can collect it.
What counts as a marketplace in physical AI
For humanoids and egocentric data, a marketplace should do more than link to downloads. A useful market layer helps a buyer answer:
- What body, sensor stack, camera viewpoint, and action space does this data represent?
- Is the data human egocentric video, robot teleoperation, autonomous robot logs, simulation, or a mixture?
- Is it commercially usable for model training, evaluation, product development, and deployment?
- Can the seller prove consent, ownership, provenance, and distribution rights?
- Is the dataset delivered in a usable format such as LeRobot, RLDS, HDF5, WebDataset, Parquet, ROS bags, MCAP, MP4 plus JSON, or a documented custom schema?
- Can the buyer request additional labels, conversion, QA, or new collection?
That is why generic cloud data marketplaces are not the center of this article. They are excellent for analytic feeds. Humanoid and egocentric data needs a richer trust packet: embodiment, modalities, rights, collection process, annotation quality, privacy handling, and integration cost.
Dedicated robotics and physical AI marketplaces
This is the small but important category the article should start with.
Humanoids Data is built around buying and selling robotics training datasets for humanoid robot learning: motion, vision, teleoperation, and interaction data. The value is not only inventory. It is helping buyers express a missing-data brief and helping sellers package datasets in a way robotics teams can evaluate.
HumanoidLayer describes itself as a managed robotics data marketplace and refinery. Its public page is unusually aligned with what buyers need: search by modality, robot type, task, license, and environment; support for formats such as LeRobot, HDF5, RLDS, WebDataset, and Parquet; dataset enrichment; annotation QA; license review; and custom collection briefs. It also indexes open datasets like DROID, BridgeData, Open X-Embodiment, ALOHA, and Egocentric-100K, while warning that commercial use depends on the original license and subset restrictions.
Robotics Center of Silicon Valley presents a marketplace for robot demonstrations, manipulation trajectories, and navigation datasets. Its seller page describes HDF5, RLDS, and LeRobot uploads, metadata requirements for robot type, task category, camera configuration, environment details, and license options: research-only, commercial, or exclusive. It also emphasizes robot URDFs, joint configs, camera intrinsics, task descriptions, and no personally identifiable information.
Truelabel is another physical AI marketplace-style source. It lets buyers request egocentric, exocentric, teleoperation, or cinematic data by modality, scene, location, and license, then routes the request through off-the-shelf catalogs, its workforce, or vendors. Its public positioning is directly useful for robotics teams because it centers rights, metadata, samples, and delivery to S3, GCS, or Azure rather than only a download link.
Datarade's robotics page is a broader data broker rather than a robotics-only market, but it is still useful because it shows real purchasable demand categories. The page lists egocentric video products marketed for physical AI and robotics, including first-person video datasets with thousands of hours and starting prices around $30,000 to $100,000 depending on provider and product.
Defined.ai is a cleaner example of a general AI-data marketplace with robotics inventory. Its robotics filter showed two listed datasets during this research pass: a 1,000-hour multimodal household robotics dataset and a 100-hour egocentric household-activities dataset. That is the kind of catalog surface buyers expect, but the diligence still happens in the contract and sample review.
Luel is a rights-cleared multimodal data marketplace and collection engine. Its public catalog includes "General Egocentric Video" for first-person, head-mounted-style recordings across households, factories, shops, and other environments, with multi-sensor IMU data. Bellu also surfaced as a physical AI marketplace for egocentric, hand-object, robot perception, and manipulation data, but its public page was intermittently unavailable during this research pass, so I would treat it as a lead to verify rather than a confirmed procurement route.
The caveat: this category is young. Do not assume a marketplace badge means the data is ready for commercial humanoid training. It means there is a procurement path. The buyer still needs source documentation, sample episodes, license terms, consent position, data schema, and quality evidence.
Commercial egocentric and robotics data vendors
When the exact dataset does not exist, custom collection is usually the real market.
Scale AI Physical AI positions its product as a Physical AI Data Engine: a global collection network, robotics data factories, distributed collectors, operating businesses, teleoperated demonstrations, a robotless egocentric platform, bespoke collection platforms, multimodal grounding annotations, and internal policy fine-tuning and evaluation. Scale says more than 1,000 hours of demonstration data are collected and uploaded to its platform every day. Treat that as a vendor claim, but the product category is clear: enterprise robotics data programs, not a self-serve dataset shelf.
Appen Physical AI covers LiDAR annotation, 3D sensor fusion, in-cabin automotive intelligence, biometric human-centric data, and world model data collection. Its ready-to-use examples include robot-perspective Roomba View images, action videos, and hand gesture videos. This is useful for teams that need collection plus annotation, especially for sensor fusion, human behavior, and embodied scene capture.
Toloka Robotics focuses on human data for robotics models: crowdsourced home recordings, onsite collection, frame-by-frame labeling, temporal event marking, task success evaluation, sensor alignment, and delivery in custom schemas. Its page is honest about the key operational split: crowdsource when you need household diversity, go onsite when you need exact hardware or lighting control.
DataForce is broader, but relevant for custom video, image, audio, user-study, and computer-vision collection through a large contributor network and supervised sites. It is not robotics-specific in the way Scale or HumanoidLayer is, but it can support physical-world data programs when the capture protocol is clear.
Claru's egocentric video page is directly aimed at embodied AI. Claru claims commercially licensed egocentric video, 500,000+ clips, 10,000+ contributors, 100+ cities, and enrichment layers such as depth, segmentation, pose, optical flow, captions, and action labels. The page compares commercial egocentric supply against academic datasets like Ego4D and EPIC-KITCHENS. As with any vendor page, treat the numbers as claims to validate through samples, consent artifacts, contract terms, and delivery specs.
iMerit is more specific than a generic labeling vendor here. Its egocentric video page describes first-person wearable-camera capture, activity-specific scenarios, indoor and outdoor environments, optional downstream annotation, and a humanoid robotics case study around in-home task recording with Meta Quest 3 head-mounted cameras.
micro1 Robotics focuses on structured real-world data for physical policy models: POV and third-person capture, household and industrial demonstrations, action-level manipulation annotations, synchronized stereo video, IMU, multi-camera data, 2D/3D annotation, QC, and evaluation.
TELUS Digital Physical AI is relevant for larger programs. Its page explicitly calls out egocentric and wrist-mounted video, robot manipulation, humanoid interactions, RGB-D, LiDAR, IMU, force, torque, tactile sensors, teleoperation, digital twins, and globally scaled data operations.
Encord Physical AI is more infrastructure than marketplace, but it belongs in the sourcing map because it handles video-native annotation, curation, sensor-fusion visualization, VLA labels, and data flywheels for robotics and autonomous systems. If you already have raw robot or egocentric data, Encord is the kind of layer that can turn it into a trainable dataset.
Nexdata and Unidata are broader data vendors with explicit embodied-AI and robotics pages. Nexdata claims services around humanoid robots, robot arms, dexterous hands, ego-exo action video annotation, tactile and force feedback, and multi-robot collection. Unidata lists robotics training data, egocentric video, robot arm data, humanoid robot data, mobile/service robot data, simulation data, and LeRobot-oriented sample datasets. For both, ask for sample data, rights documentation, and schema details before treating catalog claims as usable inventory.
The buying pattern is different from downloading an open dataset. You should send these vendors a data brief: task, environment, viewpoint, hardware, capture protocol, annotation schema, privacy rules, target format, license scope, and acceptance tests. The more exact the brief, the less likely you are to buy attractive footage that your model cannot use.
Open robot-data hubs and direct releases
Open robotics data is the research commons and the quality bar. Sometimes it is also commercially usable. Often it is not.
Hugging Face LeRobot is the most important discovery and format layer right now. LeRobot provides models, datasets, and tools for real-world robotics in PyTorch, with human-collected demonstrations and simulated environments. The broader Hugging Face dataset hub is where many recent robotics releases, including Unitree and NVIDIA datasets, are now discoverable.
Unitree UnifoLM-WBT is one of the more directly relevant open sources for humanoid whole-body teleoperation. The collection contains Unitree G1 whole-body teleoperation datasets in rolling task-specific releases. Some cards use permissive Apache 2.0 terms, but a buyer should check each sub-dataset, not the collection name.
NVIDIA PhysicalAI datasets include GR00T-related simulated robotics data. The GR00T X-Embodiment Sim card lists 9,000 cross-embodied bimanual trajectories, 240,000 humanoid tabletop manipulation trajectories, 72,000 robot-arm kitchen trajectories, and a Unitree G1 loco-manipulation subset. This is useful for synthetic physical AI and post-training workflows, with the usual synthetic-data questions about simulator provenance and transfer.
RoboMIND is a major missing source for this market map. It reports 107,000 real-world teleoperation trajectories across 479 tasks, 96 object classes, and four embodiments: Franka, UR5e, AgileX dual-arm, and a humanoid robot with dual dexterous hands. The project also includes failure demonstrations with causes, multi-view observations, proprioceptive state, and language descriptions. For humanoid and multi-embodiment manipulation, it is one of the most relevant open references to check.
Humanoid Everyday is directly humanoid-specific. It reports 10.3k trajectories, more than 3 million frames, 260 tasks, and seven categories across loco-manipulation, deformable manipulation, articulated manipulation, tool use, high-precision manipulation, and human-robot interaction. It records RGB, depth, LiDAR, tactile inputs, natural language annotations, and 30 Hz multimodal sensory data. License and access details should still be verified at the release source.
1X World Model Compression Challenge data is useful for world-model framing rather than direct manipulation policy learning. The Hugging Face card provides tokenized robot video and state data for the 1X World Model Compression Challenge under Apache 2.0. Because the raw video and other 1X materials can carry different terms, buyers should separate tokenized challenge data from raw video rights.
DROID is not humanoid whole-body data, but it is one of the strongest open manipulation references: 76,000 demonstration trajectories, 350 hours of interaction data, 564 scenes, 86 tasks, and 50 collectors on a standardized Franka platform. It is useful for manipulation pretraining, data packaging, and evaluating what "in the wild" robot collection can look like.
BridgeData V2 provides 60,096 robot manipulation trajectories, including 50,365 teleoperated demonstrations, collected on a WidowX arm. The project states that all data is under Creative Commons Attribution 4.0, which is commercially friendlier than many academic datasets, with attribution.
Open X-Embodiment is the big cross-embodiment aggregate: more than one million real robot trajectories across 22 robot embodiments, pooled from 60 datasets and 34 labs. The buyer caveat is just as large: license terms vary by source dataset. There is no single universal commercial license for the whole mixture.
AgiBot World and Fourier ActionNet are closer to humanoid manipulation. AgiBot World reports more than one million trajectories across 217 tasks and five deployment scenarios, collected with more than 100 robots. Fourier ActionNet reports 30,000+ humanoid bimanual dexterous teleoperation trajectories. Both are valuable quality bars, but their public releases use non-commercial share-alike style terms, so commercial training usually needs separate permission.
RH20T is worth checking for contact-rich manipulation and tactile-style diligence, but license terms vary across subsets. ALOHA and LeRobot-hosted ALOHA variants are useful for bimanual teleoperation patterns, but license checks still need to happen per dataset.
RoboSet, BC-Z, FurnitureBench, DECO-50, and RoboTwin 2.0 fill out the manipulation and bimanual side of the map. RoboSet reports 30,050 household tabletop trajectories, including 9,500 teleoperated demos. BC-Z is older but useful as a 100-task VR-teleoperated manipulation precedent. FurnitureBench reports 219.6 hours and 5,100 teleoperation demonstrations for long-horizon furniture assembly. DECO-50 is a 50-hour bimanual dexterous manipulation dataset with tactile sensing under Apache 2.0. RoboTwin 2.0 is a synthetic bimanual benchmark and data generator across 50 dual-arm tasks and five robot embodiments. None of these replaces humanoid whole-body data, but they are useful sources and quality bars for manipulation-heavy buyers.
AgiBot World is a useful quality bar for humanoid data packaging: robot embodiments, task coverage, annotations, and release structure all matter. Source: AgiBot World repository.
Open robot data should be used deliberately. Use permissive sets for commercial pretraining where licenses allow. Use non-commercial sets as benchmarks, research references, or quality bars. Use all of them to understand what a paid dataset should document.
Egocentric datasets for embodied AI
Egocentric data is not a side category. For humanoids and physical AI, it can be one of the most important sources of human task evidence: hands, objects, tools, kitchens, workshops, warehouses, navigation, repair, social context, and long-horizon procedures from the actor's point of view.
The limitation is also clear. Egocentric human video usually lacks robot state, robot action commands, force, tactile data, and embodiment-specific control labels. It can support perception, world models, task understanding, language grounding, hand-object interaction, and imitation priors. It is not automatically policy-ready robot data.
Ego4D is the core reference. It contains 3,670 hours of daily-life egocentric video from 923 participants across 74 locations in 9 countries, captured with seven types of head-mounted cameras. It includes benchmark areas such as episodic memory, hand-object interaction, audio-visual diarization, social interaction, and forecasting. Access requires signing license terms and waiting for approval. It is excellent for research and embodied perception, but a commercial buyer should have counsel review the license and permitted use.
Ego-Exo4D adds synchronized first-person and third-person views for skilled human activities. It reports 1,286.3 hours of video from 740 camera wearers across 13 cities, using Project Aria glasses plus external GoPro cameras. This is especially useful for skill understanding, because it connects what the actor sees with external views of the same action.
Project Aria datasets are important because they come from sensor-rich glasses rather than ordinary cameras. Aria Everyday Activities includes RGB, SLAM cameras, eye-tracking cameras, IMUs, barometer, magnetometer, microphones, trajectories, semi-dense point clouds, gaze, calibration, speech-to-text, and multi-device synchronization. That is the kind of sensor metadata robotics buyers should care about, even if the license or research context limits commercial use.
HOT3D is the Project Aria/Quest 3 hand-object dataset to know. It contains 833 minutes of egocentric multi-view recordings, 1.5 million multi-view frames, 19 subjects, 33 rigid objects, 3D hand and object poses, high-fidelity 3D object models, Aria SLAM point clouds, and gaze. It is strong for 3D hand-object tracking, but its split license terms mean commercial buyers must review the exact data component.
Nymeria expands the Project Aria family into full-body motion. It reports 300 hours of daily activities, 3,600 hours of video data, 1,200 sequences, 264 participants, 50 locations, synchronized Project Aria glasses, wrist devices, inertial motion capture, and motion-language descriptions. Meta releases it under CC BY-NC 4.0, so it is a research reference unless separate commercial permission exists.
EPIC-KITCHENS-100 is a focused kitchen dataset: 45 kitchens, 4 cities, head-mounted camera, 100 hours of recording, 20 million frames, multi-language narrations, 90,000 action segments, 97 verb classes, and 300 noun classes. It is useful for household manipulation and food-prep understanding, but it is publicly available for research purposes, not a default commercial training source.
Assembly101 is a procedural manipulation benchmark: 4,321 videos of people assembling and disassembling 101 toy vehicles, captured with 8 static and 4 egocentric cameras, 100,000 coarse action segments, 1 million fine-grained action segments, 18 million 3D hand poses, and mistake annotations. It is CC BY-NC 4.0, so it is a quality bar rather than production training data without permission.
HoloAssist is useful because it includes interactive assistance, not only passive activity. It spans 169 hours from 350 instructor-performer pairs, with synchronized egocentric streams, action and conversation annotations, mistake detection, intervention prediction, hand forecasting, depth, gaze, hand pose, head pose, IMU, camera calibration, and labels. It is released under CDLA Permissive 2.0, which makes it worth a closer commercial review.
HOI4D focuses on 4D egocentric human-object interaction: 2.4 million RGB-D frames over 4,000 sequences, 9 participants, 800 object instances, 16 categories, and 610 indoor rooms, with annotations for segmentation, 3D hand pose, object pose, hand action, reconstructed meshes, and point clouds. This is a strong hand-object reference, though buyers need to review commercial permissions.
EgoVerse is newer and unusually aligned with robot learning. It describes a living dataset of human demonstrations for robot learning with 1,362 hours, about 80,000 episodes, 1,965 tasks, 240 scenes, 2,087 demonstrators, accurate camera poses, 3D head tracking, and dense language annotation. Because the market promise is human-to-robot transfer, license and access terms matter as much as the headline scale.
EgoSchema is not a capture source, but it is useful for evaluation: 5,000 human-curated multiple-choice QA pairs spanning more than 250 hours of Ego4D clips for long-form video understanding. EgoObjects and EgoPAT3D are narrower object and 3D action-target references. They are worth knowing when the buyer's missing data is not "more video" but better object persistence, hand-object labels, or 3D action-target supervision.
EgoDex is the Apple Vision Pro dexterous manipulation dataset. It is especially relevant to humanoid hands and tabletop manipulation, but it is not a clean commercial source: public materials point to restrictive non-commercial, no-derivatives style terms. Treat it as a research benchmark unless separate rights are secured.
The practical egocentric split is simple:
- Use academic egocentric datasets to understand tasks, benchmarks, and what annotations matter.
- Use permissive datasets only after license review.
- Commission commercial egocentric collection when you need production training rights, custom tasks, fresh environments, or consent documentation.
- Do not pretend human video has robot actions. If the model needs actions, pair egocentric data with teleoperation, retargeting, simulation, or robot demonstrations.
Which source should a humanoid team use?
If you need whole-body humanoid policy data, start with humanoid-specific marketplaces and direct robot releases: Humanoids Data, HumanoidLayer, Robotics Center, Truelabel, Unitree, RoboMIND, Humanoid Everyday, 1X, AgiBot, Fourier, NVIDIA PhysicalAI, and any seller with robot state/action logs. Egocentric human video can help, but it is not a substitute for robot commands.
If you need hand-object or dexterous manipulation priors, combine egocentric datasets with bimanual and robot-manipulation data. Good references include Ego4D, HOT3D, HOI4D, HoloAssist, Assembly101, EgoDex, DECO-50, ALOHA, DROID, BridgeData, RoboSet, FurnitureBench, AgiBot, Fourier, and commercial egocentric vendors.
If you need household, warehouse, retail, workshop, or inspection workflows, custom collection may beat open data. A vendor can capture the exact environment class, task list, camera placement, and metadata you need. This is where Claru, Scale, Appen, Toloka, iMerit, micro1, TELUS, Encord, Nexdata, Unidata, Defined.ai, Datarade-listed providers, or a Humanoids Data sourcing brief can be more useful than another benchmark.
If you need commercially usable pretraining data quickly, start with sources that clearly state permissive terms: selected Unitree datasets, DROID, BridgeData V2, NVIDIA PhysicalAI datasets, 1X tokenized challenge data, DECO-50, some LeRobot-hosted datasets, HoloAssist, and marketplace listings with explicit commercial rights. Still check every dataset card and contract.
If you need a research benchmark or quality bar, use the best academic releases even when you cannot train commercial models on them. AgiBot, Fourier, EgoDex, EPIC-KITCHENS, Ego4D, Ego-Exo4D, Project Aria, HOT3D, Nymeria, Assembly101, RoboTwin, and Open X-Embodiment can teach you what good task coverage, annotation, and packaging look like.
Buyer diligence checklist
Before buying or using humanoid, egocentric, or physical AI data, ask for a data room, not just a preview video.
Check the source:
- Who collected the data, and who owns distribution rights?
- Were participants, operators, or bystanders consented or redacted?
- Is the license valid for commercial model training, evaluation, deployment, and derivative weights?
- Are third-party environments, brands, screens, faces, voices, or copyrighted content present?
Check the embodiment:
- Is the data human egocentric video, robot teleoperation, robot logs, simulation, or mixed?
- If robot data, which robot, hands, sensors, controller, action space, control frequency, and coordinate frames are represented?
- If human video, what retargeting or pairing strategy will turn it into robot-relevant supervision?
Check the modalities:
- Are video, state, actions, depth, tactile, force, language, labels, and outcomes aligned on one timeline?
- Are camera intrinsics, extrinsics, timestamps, frame rates, dropped frames, and calibration included?
- Are failures, interventions, recoveries, unsafe attempts, and aborted episodes preserved or filtered?
Check delivery:
- Is the data in LeRobot, RLDS, HDF5, WebDataset, Parquet, ROS bag, MCAP, MP4 plus JSON, or another documented format?
- Are sample loaders, schemas, splits, checksums, and data cards included?
- Can your team load and inspect a representative sample in less than a day?
For deeper review, use the humanoid robot data evaluation checklist. The core point is the same: fit beats volume.
Seller checklist for humanoid and egocentric data
Sellers should not describe valuable robotics data as "some videos." The market needs specificity.
Prepare:
- A dataset card with modalities, size, task families, environments, capture dates, and collection method.
- An embodiment card for robot data: robot model, hand or gripper, sensors, cameras, joint names, control frequency, action space, coordinate frames, and retargeting.
- A capture card for egocentric data: camera mount, resolution, frame rate, field of view, IMU or depth availability, environments, task protocol, participant handling, and redaction process.
- A rights packet: ownership, consent, privacy treatment, commercial-use scope, redistribution rules, exclusivity options, and known restrictions.
- A quality packet: synchronization, calibration, missing data rates, annotation schema, reviewer process, excluded episodes, and known limitations.
- A sample package: representative clips or episodes, schema docs, loader code, manifests, and checksums.
The humanoid robot data collection equipment guide is useful here because the rig is part of the product. A wrist camera, a head camera, a force sensor, a tactile hand, a motion tracker, or an XR teleoperation interface changes what the data can teach.
Where Humanoids Data fits
Humanoids Data should sit between buyer intent and fragmented supply.
A buyer might say: "I need commercially usable humanoid or human-egocentric data for bimanual warehouse tote handling, with wrist or head camera views, task phase labels, failures, and rights for model training." That is much more useful than "I need robot data."
A seller might say: "I have 2,000 teleoperated episodes on a bimanual humanoid torso with head RGB-D, wrist RGB, joint state, action commands, task labels, failures, and commercial licensing." That is much more useful than "I have demos."
The job of the marketplace is to make those two statements meet, then force the right diligence: embodiment, task, modality, rights, provenance, quality, and delivery. If you need humanoid, egocentric, teleoperation, manipulation, or physical AI training data, use the buyer request form. If you hold data that robotics teams could train on, prepare the trust packet and use the seller submission form.