Dataset buying · 10 June 2026

Open Humanoid Robot Datasets in 2026: What's Free and What Licenses Allow

There has never been more free robot training data. Open X-Embodiment pools over a million real robot trajectories, AgiBot World publishes industrial-scale humanoid demonstrations, Unitree keeps uploading whole-body teleoperation data, and Hugging Face now hosts tens of thousands of robotics datasets.

There are two catches. Most open robot data was not captured on a humanoid. And "open" very often does not mean "usable for commercial training": the licenses behind these datasets range from permissive Apache 2.0 to non-commercial, no-derivatives terms that quietly rule out the exact thing a robotics company wants to do with them.

This guide covers the open datasets a humanoid team should know in mid-2026, what each one contains, and what its license actually allows. If you are new to the category, the broader guide to humanoid robot training data explains the dataset types this article keeps referring to: motion, vision, teleoperation, and interaction data.

The short list worth knowing

  • Cross-embodiment robot data: Open X-Embodiment and DROID. Mostly robot arms, still useful for pretraining and perception.
  • Humanoid-specific data: AgiBot World and its 2026 themed releases, Unitree's UnifoLM-WBT teleoperation collection, and NVIDIA's open GR00T simulation data.
  • Human egocentric video: Ego4D and Apple's EgoDex. Cheap scale, no robot state, very different licenses.
  • Community data: the LeRobot ecosystem on Hugging Face, which has become the closest thing robotics has to a common exchange format.

The rest of the article goes through each group, then turns to licensing, which is where most teams get surprised.

Cross-embodiment foundations: Open X-Embodiment and DROID

The Open X-Embodiment dataset is still the reference point for open robot data. It pools 60 existing robot datasets from 34 labs into a consistent RLDS format, covering more than one million real robot trajectories across 22 embodiments, from single arms to bimanual systems and quadrupeds.

DROID is a narrower but cleaner asset: roughly 76,000 teleoperated demonstration episodes, about 350 hours, collected across 564 scenes and 86 tasks by 50 data collectors on a standardized Franka arm platform. The team released the full dataset under CC-BY 4.0, which permits commercial use with attribution, and has kept improving it since release with extra language annotations and better camera calibrations.

Neither dataset is humanoid data. For a humanoid team, their value is mostly in pretraining, perception, task semantics, and benchmarking data pipelines, not in whole-body policy learning. They also illustrate the two licensing patterns to watch. DROID has one clear license. Open X-Embodiment does not: the collaboration publishes its own materials under CC-BY 4.0, but each pooled dataset keeps its original terms, and several subsets carry non-commercial restrictions. The project maintains a per-dataset spreadsheet for exactly this reason. If your legal position depends on "the OXE license," you do not have a legal position yet.

AgiBot World: industrial-scale humanoid data with non-commercial terms

AgiBot World is the most serious open humanoid manipulation corpus right now. The original release, framed by the AgiBot team as a step toward an "ImageNet moment" for embodied AI, reports more than one million trajectories collected on a fleet of 100 robots across 100+ replicated real-life scenarios, with hardware that includes tactile sensors and dexterous hands. The full Beta set alone is about 43.8 TB.

AgiBot World large-scale robot learning platform overview

Source: AgiBot World repository.

The 2026 program is what makes this section timely. AgiBot is releasing AgiBot World 2026 in themed phases, collected on its G2 robot platform in real commercial and home environments. Theme 1 focused on imitation learning data. Theme 2, announced on June 3, 2026 as "Rich Interaction," deliberately keeps the messy physics most datasets filter out: missed grasps, collisions, object drops, unstable contacts, and liquid splashes, captured through what the team calls an exploratory teleoperation paradigm.

The dataset card shows unusually deep annotation for an open release: long-horizon episodes segmented into subtask instructions, step-level skill labels, 2D bounding boxes on interacted objects, and explicit error and intervention frame types with flags for whether a failure was recoverable. That structure is close to what good commercial datasets charge for, which is why the license matters so much here.

AgiBot World is licensed CC BY-NC-SA 4.0. NC means no commercial use without a separate arrangement. SA means derivatives must carry the same license. For research groups and benchmarking, that is workable. For a company training a commercial humanoid policy, this dataset is a study resource and a quality bar, not training material you can simply ingest.

Unitree UnifoLM-WBT: whole-body teleoperation, permissively licensed

Whole-body humanoid teleoperation data has been the most requested and least available category in open robotics. Unitree started changing that on March 5, 2026, when it began publishing its UnifoLM-WBT dataset collection: real-world whole-body teleoperation on the Unitree G1, covering household and open-environment tasks where locomotion and manipulation are captured as one coordinated system.

Unitree's announcement of the UnifoLM-WBT dataset. Source: Unitree Robotics on YouTube.

The collection on Hugging Face now spans more than 80 task-specific sub-datasets with rolling updates, from loading dishwashers and washing machines to collecting clothes, with variants across different hands and camera configurations. Episodes ship in LeRobot format, and the sub-dataset cards I checked are licensed Apache 2.0, which permits commercial use.

Two cautions keep this honest. First, licenses are set per sub-dataset, so a team should confirm the card on each one it pulls rather than assuming the collection is uniform. Second, permissive does not mean complete: these are demonstrations from one vendor's robot and capture stack, so embodiment fit, calibration details, and task coverage still need the same review you would give a paid dataset.

NVIDIA's open GR00T data: synthetic scale, clean license

NVIDIA's contribution to the open pile is mostly synthetic. The PhysicalAI GR00T X-Embodiment simulation dataset on Hugging Face is released under CC-BY 4.0 and includes simulated humanoid manipulation subsets, such as GR-1 tabletop tasks, generated through the Isaac GR00T pipeline.

A clean license on synthetic data solves one problem and leaves another. Commercial use is allowed, but the diligence questions shift to generation provenance: what seeded the trajectories, which simulator produced them, and how they were validated. Our analysis of NVIDIA Isaac GR00T covers how that data stack fits together, and the GRAIL pipeline shows what serious generation provenance looks like when synthetic humanoid data is meant to transfer to real robots.

Human video at scale: Ego4D and EgoDex

Human egocentric video is the cheapest way to scale physical-world coverage, and humanoid labs increasingly treat it as a first-class pretraining source. Two datasets dominate the open conversation, and their licenses could not be more different.

Ego4D offers more than 3,600 hours of daily-life egocentric video from over 900 participants across 74 locations in 9 countries. Access requires signing a license agreement, with credentials issued in about two days. The terms are bespoke rather than Creative Commons, and they explicitly permit using the data to research, develop, and train models for commercial as well as noncommercial product development within the defined purposes. For a dataset born in academia, that is unusually friendly to industry teams, as long as someone actually reads the agreement.

Apple's EgoDex is the more targeted asset: 829 hours of 30 Hz egocentric video, about 90 million frames and 338,000 demonstrations across 194 tabletop manipulation tasks, recorded on Vision Pro with paired 3D tracking of the head, upper body, and hands plus language annotations. It is the largest open corpus of dexterous human manipulation. It is also licensed CC BY-NC-ND: non-commercial, no derivatives. Strictly read, that excludes commercial training and even adaptation work. EgoDex is a research benchmark, not a commercial training source.

The general lesson from both: human video gives you scenes, hands, and task semantics, but no robot state, no actions, and no contact labels. It is a complement to robot data, not a substitute, and the license decides which side of your stack it can touch.

LeRobot community data: huge, uneven, worth a look

Hugging Face's LeRobot has quietly become the exchange layer for open robot data. IEEE Spectrum reported in May 2026 that robotics datasets on the Hugging Face Hub grew from 1,145 at the end of 2024 to more than 58,000, now the largest dataset category on the platform. Commercial releases increasingly target the LeRobot format directly, including the Unitree and AgiBot data above.

Treat the community pool the way you would treat any user-generated content: a long tail of small, single-setup datasets with uneven documentation, mixed quality, and per-dataset licenses. Only a small fraction is humanoid data. The practical value is the format convergence: standardized episode structure, Parquet plus MP4 storage, and shared tooling mean evaluating a candidate dataset, free or paid, takes hours instead of weeks.

The license reality check

Across this list, the pattern is clear, and it is the part most teams skip until procurement asks:

  • Apache 2.0 (Unitree sub-datasets) and CC-BY 4.0 (DROID, NVIDIA GR00T sim data): commercial training generally allowed, attribution required.
  • Bespoke signed agreements (Ego4D): commercial uses can be allowed, but the permitted purposes are enumerated, so the agreement has to be read, not skimmed.
  • CC BY-NC-SA (AgiBot World): no commercial use, and share-alike obligations on derivatives.
  • CC BY-NC-ND (EgoDex): no commercial use and no derivatives, which is about as restrictive as open releases get.
  • Mixed aggregates (Open X-Embodiment): no single answer; the license audit happens per subset.

Three habits keep this manageable. Record the license and source of every external dataset that enters a training mix, including the date you checked it, since cards change. Treat NC and ND terms as hard exclusions for commercial work unless you have something in writing from the rights holder. And apply the same scrutiny to open data that you would to a purchase: the buyer evaluation checklist applies almost unchanged, because a dataset that costs nothing can still cost you a retraining run.

What open data still does not give you

After all of that, a well-run humanoid team will still find gaps that no open dataset fills.

Open data is almost never captured on your embodiment, with your hands, cameras, control rates, and coordinate frames. It rarely covers your deployment environments or your exact task distribution. The strongest humanoid releases carry non-commercial terms precisely because the data is valuable. Failure and recovery coverage on your robot, fresh captures for new products, consent and provenance guarantees, exclusivity, and someone accountable for quality are all things the open pool does not provide.

That is the honest framing for the market: open datasets set the baseline and teach you what good packaging looks like. AgiBot's annotation layers and Unitree's whole-body episodes are now the reference points a commercial dataset gets compared against. Sellers who document provenance, rights, and embodiment details better than the free alternatives are the ones with a product.

How to use open data before you license more

A sensible sequence for a humanoid robotics team in 2026 looks like this.

Start with the permissively licensed sets that match your needs: DROID or Open X-Embodiment's commercially usable subsets for manipulation pretraining, NVIDIA's sim data for pipeline development, Unitree's WBT collection if you work near the G1 embodiment. Use the restrictive sets, AgiBot World and EgoDex, for research, benchmarking, and internal quality bars, and keep them out of commercial training runs. Then write down what is missing for your robot and your tasks. That gap list, not a generic appetite for more hours, is what you should buy against.

If the gap list points to data you cannot collect yourself, describe it in the buyer request form and we will look for matching sellers. And if you hold humanoid or robotics datasets that cover what the open pool cannot, document the embodiment, provenance, and rights clearly and list them through the seller submission form. The open ecosystem just raised the bar for everyone; the commercial opportunity is in clearing it.