# OmniWorld **Repository Path**: hf-datasets/OmniWorld ## Basic Information - **Project Name**: OmniWorld - **Description**: Mirror of https://huggingface.co/datasets/InternRobotics/OmniWorld - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-09-30 - **Last Updated**: 2025-10-16 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README --- license: cc-by-nc-sa-4.0 size_categories: - n>1T task_categories: - text-to-video - image-to-video - image-to-3d - robotics - other language: - en pretty_name: OmniWorld ---

OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

Yang Zhou1Yifan Wang1Jianjun Zhou1,2Wenzheng Chang1Haoyu Guo1Zizun Li1Kaijing Ma1
Xinyue Li1Yating Wang1Haoyi Zhu1Mingyu Liu1,2Dingning Liu1 Jiange Yang1 Zhoujie Fu1
Junyi Chen1Chunhua Shen1,2Jiangmiao Pang1Kaipeng Zhang1 Tong He1†
1Shanghai AI Lab  2ZJU 

# 🎉NEWS - [2025.10.15] The **OmniWorld-Game Benchmark** is now live on Hugging Face! - [2025.10.8] The **OmniWorld-HOI4D** and **OmniWorld-DROID** dataset is now live on Hugging Face! - [2025.9.28] The **OmniWorld-CityWalk** dataset is now live on Hugging Face! - [2025.9.21] 🔥 The **OmniWorld-Game** dataset now includes **5k splits** in total on Hugging Face! - [2025.9.17] 🎉 Our dataset was ranked **#1 Paper of the Day** on 🤗 [Hugging Face Daily Papers!](https://huggingface.co/papers/2509.12201) - [2025.9.16] 🔥 The first **1.2k splits** of **OmniWorld-Game** is now live on Hugging Face! **We will continue to update, more data is coming soon, Stay tuned!** # 🧭 Dataset Overview and Navigation OmniWorld is a multi-domain and multi-modal dataset comprising several distinct sub-datasets. 🙂 indicates the modality is newly (re-)annotated by us, ✅ denotes ground-truth data that already exists in the original dataset, ❌ marks missing modalities. | Dataset | Domain | # Seq. | FPS | Resolution | # Frames | Depth | Camera | Text | Opt. flow | Fg. masks | Detailed Guide | | :-- | :-- | --: | --: | :--: | --: | :--: | :--: | :--: | :--: | :--: | :--: | | OmniWorld-Game | Simulator | 96K | 24 | 1280 × 720 | 18,515K | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 | [→ See guide](#omniworld-game-detailed-guide) | | AgiBot | Robot | 20K | 30 | 640 × 480 | 39,247K | 🙂 | ✅ | ✅ | ❌ | 🙂 | [TBD] | | DROID | Robot | 35K | 60 | 1280 × 720 | 26,643K | 🙂 | ✅ | 🙂 | 🙂 | 🙂 | [→ See guide](#omniworld-droid-detailed-guide) | | RH20T | Robot | 109K | 10 | 640 × 360 | 53,453K | ❌ | ✅ | 🙂 | 🙂 | 🙂 | [TBD] | | RH20T-Human | Human | 73K | 10 | 640 × 360 | 8,875K | ❌ | ✅ | 🙂 | ❌ | ❌ | [TBD] | | HOI4D | Human | 2K | 15 | 1920 × 1080 | 891K | 🙂 | 🙂 | 🙂 | 🙂 | ✅ | [→ See guide](#omniworld-hoi4d-detailed-guide) | | Epic-Kitchens | Human | 15K | 30 | 1280 × 720 | 3,635K | ❌ | 🙂 | 🙂 | ❌ | ❌ | [TBD] | | Ego-Exo4D | Human | 4K | 30 | 1024 × 1024 | 9,190K | ❌ | ✅ | 🙂 | 🙂 | ❌ | [TBD] | | HoloAssist | Human | 1K | 30 | 896 × 504 | 13,037K | ❌ | 🙂 | 🙂 | 🙂 | ❌ | [TBD] | | Assembly101 | Human | 4K | 60 | 1920 × 1080 | 110,831K | ❌ | ✅ | 🙂 | 🙂 | 🙂 | [TBD] | | EgoDex | Human | 242K | 30 | 1920 × 1080 | 76,631K | ❌ | ✅ | 🙂 | ❌ | ❌ | [TBD] | | CityWalk | Internet | 7K | 30 | 1280 × 720 | 13,096K | ❌ | 🙂 | ✅ | ❌ | ❌ | [→ See guide](#omniworld-citywalk-detailed-guide) | | Game-Benchmark | Simulator | - | 24 | 1280 × 720 | - | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 | [→ See guide](#omniworld-game-benchmark-detailed-guide) | --- # Directory Structure This structure outlines the organization across all OmniWorld sub-datasets. Each sub-dataset (e.g., OmniWorld-Game, OmniWorld-CityWalk) maintains its unique scene folders within the shared `annotations/`, `metadata/`, and `videos/` top-level directories. ``` DATA_PATH/ ├─ annotations/ │ ├─ OmniWorld-Game/ │ │ ├─ b04f88d1f85a/ │ │ ├─ 52e80f590716/ │ │ └─ … # one folder per scene │ ├─ OmniWorld-CityWalk/ │ └─ … ├─ metadata/ │ ├─ OmniWorld-Game_metadata.csv │ ├─ OmniWorld-CityWalk_metadata.csv │ └─ … ├─ videos/ │ ├─ OmniWorld-Game/ │ │ ├─ b04f88d1f85a/ │ │ ├─ 52e80f590716/ │ │ └─ … │ ├─ OmniWorld-CityWalk/ │ └─ … └─ README.md # this guide ``` # Dataset Download You can download the entire OmniWorld dataset using the following command: ```bash # 1. Install (if you haven't yet) pip install --upgrade "huggingface_hub[cli]" # 2. Full download hf download InternRobotics/OmniWorld \ --repo-type dataset \ --local-dir /path/to/DATA_PATH ``` For downloading specific files (instead of the full dataset), please refer to the [dowanload_specific.py](https://github.com/yangzhou24/OmniWorld/blob/main/scripts/dowanload_specific.py) provided in our GitHub repository. # OmniWorld-Game Detailed Guide This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-Game** dataset. ## OmniWorld-Game Organisation and File Structure To keep the download manageable, each scene is split into multiple `.tar.gz` files: - RGB / Depth / Flow : ≤ 2 000 images per `.tar.gz`. The naming convention follows the format: `…/__.tar.gz` - Other Annotations: Additional data such as camera poses, masks, and text annotations are grouped together in a single file per scene: `…/_others.tar.gz` **Metadata Explained** (`omniworld_game_metadata.csv`) | Field Name | Description | |---------------------|-----------------------------------------------------------------------------| | `UID` | Scene ID (folder name). | | `Video Path` | Relative path to the RGB frames. | | `Annotation Path` | Relative path to all multimodal annotations.| | `Split Img Num` | Frame count across all splits of the scene. | | `Split Num` | Number of splits the scene was cut into. | | `Total Img Num` | Raw frame count before splitting. | | `Test Split Index` | Zero-based indices of splits used for the test set (comma-separated). Blank = no test split. Example: "0,5" marks the `split_0`, `split_5` as test data. | | `FPS` | Frames per second. | | `Resolution` | `width×height` in pixels. | ## OmniWorld-Game Usage Guide ### 1. Quick-Start: Extracting One Scene Below we extract RGB frames and all annotations for scene `` to a local folder of the same name. ```bash scene_id=b04f88d1f85a root=/path/to/DATA_PATH # where you store OmniWorld mkdir -p ${scene_id} # --- RGB (may span several parts) ------------------------------------------ for rgb_tar in ${root}/videos/OmniWorld-Game/${scene_id}/${scene_id}_rgb_*.tar.gz do echo "Extracting $(basename $rgb_tar)…" tar -xzf "$rgb_tar" -C ${scene_id} done # --- Depth ----------------------------------------------------------------- for d_tar in ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}_depth_*.tar.gz do echo "Extracting $(basename $d_tar)…" tar -xzf "$d_tar" -C ${scene_id} done # --- Flow ------------------------------------------------------------------ for f_tar in ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}_flow_*.tar.gz do echo "Extracting $(basename $f_tar)…" tar -xzf "$f_tar" -C ${scene_id} done # --- All other annotations -------------------------------------- tar -xzf ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}_others.tar.gz -C ${scene_id} ``` Resulting Scene Folder: ``` b04f88d1f85a/ ├─ color/ # RGB frames (.png) ├─ depth/ # 16-bit depth maps ├─ flow/ # flow_u_16.png / flow_v_16.png / flow_vis.png ├─ camera/ # split_*.json (intrinsics + extrinsics) ├─ subject_masks/ # foreground masks (per split) ├─ gdino_mask/ # dynamic-object masks (per frame) ├─ text/ # structured captions (81-frame segments) ├─ droidclib/ # coarse camera poses (if you need them) ├─ fps.txt # source video framerate └─ split_info.json # how frames are grouped into splits ``` ### 2. Modality Details #### 2.1. Split Information (`split_info.json`) Each scene is divided into several high-quality "splits". `split_info.json` tells you how the original video indices are grouped. ``` { "scene_name": "b04f88d1f85a", "split_num": 6, "split": [ [0, 1, 2, ...], // split_0 [316, 317, ...], // split_1 ... ] } ``` Meaning: - `split_num` – total number of splits in this scene. - `split[i]` – an array with the original frame indices belonging to `split i`. #### 2.2. Camera Poses (`camera/split_.json`) For every split you will find a file ``` /camera/split_.json (e.g. split_0.json) ``` containing: - `focals` – focal length in pixels (same for x and y). - `cx, cy` – principal point. - `quats` – per-frame rotation as quaternions (w, x, y, z). - `trans` – per-frame translation (x, y, z). **Minimal Reader** ```python import json from pathlib import Path import numpy as np from scipy.spatial.transform import Rotation as R def load_split_info(scene_dir: Path): """Return the split json dict.""" with open(scene_dir / "split_info.json", "r", encoding="utf-8") as f: return json.load(f) def load_camera_poses(scene_dir: Path, split_idx: int): """ Returns ------- intrinsics : (S, 3, 3) array, pixel-space K matrices extrinsics : (S, 4, 4) array, OpenCV world-to-camera matrices """ # ----- read metadata ----------------------------------------------------- split_info = load_split_info(scene_dir) frame_count = len(split_info["split"][split_idx]) cam_file = scene_dir / "camera" / f"split_{split_idx}.json" with open(cam_file, "r", encoding="utf-8") as f: cam = json.load(f) # ----- intrinsics -------------------------------------------------------- intrinsics = np.repeat(np.eye(3)[None, ...], frame_count, axis=0) intrinsics[:, 0, 0] = cam["focals"] # fx intrinsics[:, 1, 1] = cam["focals"] # fy intrinsics[:, 0, 2] = cam["cx"] # cx intrinsics[:, 1, 2] = cam["cy"] # cy # ----- extrinsics -------------------------------------------------------- extrinsics = np.repeat(np.eye(4)[None, ...], frame_count, axis=0) # SciPy expects quaternions as (x, y, z, w) → convert quat_wxyz = np.array(cam["quats"]) # (S, 4) (w,x,y,z) quat_xyzw = np.concatenate([quat_wxyz[:, 1:], quat_wxyz[:, :1]], axis=1) rotations = R.from_quat(quat_xyzw).as_matrix() # (S, 3, 3) translations = np.array(cam["trans"]) # (S, 3) extrinsics[:, :3, :3] = rotations extrinsics[:, :3, 3] = translations return intrinsics.astype(np.float32), extrinsics.astype(np.float32) # --------------------------- example usage ----------------------------------- if __name__ == "__main__": scene = Path("b04f88d1f85a") # adjust to your path K, w2c = load_camera_poses(scene, split_idx=0) # world-to-camera transform in OpenCV format print("Intrinsics shape:", K.shape) print("Extrinsics shape:", w2c.shape) ``` #### 2.3. Depth (`depth/.png`) - 16-bit PNG, one file per RGB frame. - Values are stored as unsigned integers in [0, 65535].    `0 … 100`  ≈ invalid / too close    `65500 … 65535` ≈ sky / too far **Minimal Reader** ```python import imageio.v2 as iio import numpy as np from pathlib import Path def load_depth(depthpath): """ Returns ------- depthmap : (H, W) float32 valid : (H, W) bool True for reliable pixels """ depthmap = imageio.v2.imread(depthpath).astype(np.float32) / 65535.0 near_mask = depthmap < 0.0015 # 1. too close far_mask = depthmap > (65500.0 / 65535.0) # 2. filter sky # far_mask = depthmap > np.percentile(depthmap[~far_mask], 95) # 3. filter far area (optional) near, far = 1., 1000. depthmap = depthmap / (far - depthmap * (far - near)) / 0.004 valid = ~(near_mask | far_mask) depthmap[~valid] = -1 return depthmap, valid # ---------------------------- example --------------------------------------- if __name__ == "__main__": d, mask = load_depth("b04f88d1f85a/depth/000000.png") print("Depth shape:", d.shape, "valid pixels:", mask.mean() * 100, "%") ``` Feel free to tighten the `far_mask` with `np.percentile(depthmap[~far_mask], 95)` if you need a stricter “too-far” criterion. > We provide a script to generate a fused point cloud from camera poses and depth maps. Instructions can be found in the [Point Cloud Visualization](https://github.com/yangzhou24/OmniWorld?tab=readme-ov-file#-visualize-as-point-cloud) section from our github repository. #### 2.4. Structured Caption (`text/_.json`) From every split we sample `81` frames and attach rich, structured captions. The general naming format of the text file is `_.json`, which means that the text is the description of the `start_idx` frame to the `end_idx` frame of the global video. Each text file contains the following description information - `Short_Caption`: A brief summary (1–2 sentences). - `PC_Caption`: Actions and status of the player-character. - `Background_Caption`: Fine-grained spatial description of the scene. - `Camera_Caption`: How the camera moves, such as zooms, rotates. - `Video_Caption`: ≈200-word dense paragraph combining all above.. - `Key_Tags`: string of tags that combines key features. #### 2.5. Foreground Masks (`subject_masks/split_.json`) Binary masks (white = subject, black = background) for every frame in a split. Main masked object includes: - `Human/Robotics` scenes: the active arm / robot. - `Game` scenes: the playable character or vehicle. **Minimal Reader** ```python import json from pathlib import Path from pycocotools import mask as mask_utils import numpy as np def load_subject_masks(scene_dir: Path, split_idx: int): """ Returns ------- masks : list[np.ndarray] (H, W) bool """ seg_mask_list = [] segmask_path = scene_dir / "subject_masks" / f"split_{split_idx}.json" with open(segmask_path, "r", encoding="utf-8") as f: seg_masks = json.load(f) for key in seg_masks.keys(): seg_mask = seg_masks[key] seg_mask = mask_utils.decode(seg_mask["mask_rle"]) seg_mask_list.append(seg_mask) seg_mask_list # ---------------------------- example --------------------------------------- if __name__ == "__main__": masks = load_subject_masks(Path("b04f88d1f85a"), split_idx=0) print("Loaded", len(masks), "masks of shape", masks[0].shape) ``` We also release per-frame Dynamic Masks (`gdino_mask/.png`). Each RGB image in the original video is labeled with dynamic objects (such as cars, people, and animals). White represents dynamic objects, and black represents static backgrounds. This can be used in conjunction with Foreground Masks as needed. #### 2.6. Optical Flow (`flow//...`) For every RGB frame `t` we provide dense forward optical flow that points to frame `t + 1`. Directory layout (example for frame 0 of scene `b04f88d1f85a`) ``` b04f88d1f85a/ └─ flow/ └─ 00000/ ├─ flow_u_16.png # horizontal component (u, Δx) ├─ flow_v_16.png # vertical component (v, Δy) └─ flow_vis.png # ready-made RGB visualisation (for inspection only) ``` **Minimal Reader** ```python import numpy as np import imageio.v2 as iio from pathlib import Path FLOW_MIN, FLOW_MAX = -300.0, 300.0 # change if you override the range def flow_decompress(u, v, flow_min=-FLOW_MIN, flow_max=FLOW_MAX): """ Read uint16 image and convert back to optical flow data Args: u: np.array (np.uint16) - Optical flow horizontal component v: np.array (np.uint16) - Optical flow vertical component flow_min: float - Assumed minimum value of optical flow flow_max: float - Assumed maximum value of optical flow Returns: np.array (np.float32) - Optical flow data with shape (H,W,2) """ u = u.astype(np.uint16) v = v.astype(np.uint16) u = u / 65535.0 v = v / 65535.0 u = u * (flow_max - flow_min) + flow_min v = v * (flow_max - flow_min) + flow_min res = np.stack((u, v), axis=-1) return res.astype(np.float32) def load_flow(flowpath): of_u_path = os.path.join(flowpath, "flow_u_16.png") of_v_path = os.path.join(flowpath, "flow_v_16.png") u = cv2.imread(str(of_u_path)).astype(np.uint16) v = cv2.imread(str(of_v_path)).astype(np.uint16) flow = flow_decompress(u, v) return flow # ---------------------------- example --------------------------------------- if __name__ == "__main__": flow = load_flow("b04f88d1f85a/flow/00000") print("Flow shape: ", flow.shape) ``` # OmniWorld-Game Benchmark Detailed Guide The OmniWorld-Game Benchmark is a curated subset of test splits, specifically selected from the OmniWorld-Game dataset to serve as a challenging evaluation platform, as detailed in our [paper](https://arxiv.org/abs/2509.12201). | Task | Sequence Length | Duration | Key Modalities | | :-- | :-- | --: | --: | | Geometric Prediction | 384 frames | 16 seconds| RGB, Depth, Camera Poses | | Video Generation | 81 frames | 3.4 seconds| RGB, Depth, Camera Poses, Text | Each sequence in the benchmark is challenging, featuring rich dynamics that accurately reflect real-world complexity. They are accompanied by high-fidelity ground-truth annotations for camera poses and depth. ## Data Access and Organization The benchmark annotation data is packaged into `.tar.gz` files located under the `OmniWorld/benchmark` directory. Each archive is named in the format `_.tar.gz`. ## Extracted Directory Structure ``` _/ ├─ depth/ │  ├─ 000000.npy       # (H, W) Depth map. Already processed and stored using the OmniWorld-Game Depth reading method. │  ├─ 000001.npy │  └─ ... ├─ image/              # High-resolution RGB frames (720×1280 pixels) │  ├─ 000000.png │  ├─ 000001.png │  └─ ... ├─ camera_poses.npy    # (num_frames, 4, 4) Camera-to-World (C2W) transformation matrices. ├─ intrinsics.npy      # (num_frames, 3, 3) Intrinsic camera matrices in pixel space. ├─ text_caption.json   # The structured text caption associated with the sequence. └─ video.mp4           # MP4 video file corresponding to the PNG frames in the 'image/' directory. ``` The depth maps are already processed and stored using the OmniWorld-Game Depth reading method. # OmniWorld-CityWalk Detailed Guide This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-CityWalk** dataset. ## OmniWorld-CityWalk Organisation and File Structure The **OmniWorld-CityWalk** dataset is a collection of re-annotated data derived from a subset of the [Sekai-Real-Walking-HQ](https://github.com/Lixsp11/sekai-codebase) dataset. You need [downloading original videos](https://github.com/Lixsp11/sekai-codebase/tree/main/dataset_downloading) and [extracting video clips](https://github.com/Lixsp11/sekai-codebase/tree/main/clip_extracting). > **Important Note:** In this repository, we **only provide the annotated data** (e.g., camera poses, dynamic masks), and **do not include the raw RGB image files** due to licensing and size constraints. Please refer to the original project for instructions on downloading and splitting the raw video data. Our annotations are designed to align with the original video frames. ### Annotation Files The camera annotation data is packaged in `.tar.gz` files located under `OmniWorld/annotations/OmniWorld-CityWalk/`. * **Naming Convention**: `omniworld_citywalk__.tar.gz`, where the indices correspond to the scene index range within the metadata file. ### Scene and Split Specifications * **Video Length**: Each source video scene is 60 seconds long. * **Frame Rate**: 30 FPS. * **Total Frames**: 1800 frames per scene. * **Split Strategy**: Each scene is divided into **6 splits of 300 frames each** for detailed annotation. **Metadata Explained** (`omniworld_citywalk_metadata.csv`) | Field Name | Description | | :--- | :--- | | `index` | The sequential index number of the scene. | | `videoFile` | The video file name, formatted as `__`. The corresponding source video on YouTube can be accessed via `https://www.youtube.com/watch?v=`. | | `cameraFile` | The directory name for the camera annotation data, which is named after the video file. | | `caption` | The dense text description/caption for the video segment. | | `location` | The geographical location where the video was filmed. | | `crowdDensity` | An assessment of the crowd/people density within the video. | | `weather` | The general weather condition (e.g., sunny, overcast). | | `timeOfDay` | The time of day when the video was recorded (e.g., morning, afternoon). | ## OmniWorld-CityWalk Usage Guide ### 1. Quick-Start: Extracting One Scene To access the annotations for a scene, you first need to extract the corresponding `.tar.gz` archive. After extracting one `omniworld_citywalk__.tar.gz` file, the resulting folder structure for each individual scene within the archive is as follows: ``` xpPEhccDNak_0023550_0025350/  # Example Scene name (videoFile) ├─ gdino_mask/          # Per-frame dynamic-object masks (.png) ├─ recon/               # Camera and 3D reconstruction data per split │  ├─ split_0/ │  │  ├─ extrinsics.npz # Per-frame camera extrinsics: (frame_num, 3, 4) in OpenCV world-to-camera format │  │  ├─ intrinsics.npz # Per-frame camera intrinsics: (frame_num, 3, 3) in pixel units │  │  └─ points3D_ba.ply # Sparse and accurate point cloud data after Bundle Adjustment (BA) for this split │  ├─ split_1/ │  │  └─ ... |  └─ ... ├─ image_list.json      # Defines the frame naming convention (e.g., 000000.png to 001799.png) └─ split_info.json      # Records how frames are grouped into 300-frame splits ``` ### 2. Modality Details #### 2.1. Split Information (`split_info.json`) Scene frames are segmented into 300-frame splits for annotation. The mapping and division information is stored in `split_info.json`. #### 2.2. Camera Poses (`recon/split_/...`) Camera poses are provided as NumPy compressed files (`.npz`) containing the extrinsics (world-to-camera rotation and translation) and intrinsics (focal length and principal point). **Minimal Reader** ```python import numpy as np # Load Extrinsics (World-to-Camera Transform in OpenCV format) extrinsics = np.load("recon/split_0/extrinsics.npz")['extrinsics']  # Shape: (frame_num, 3, 4) # Load Intrinsics (in Pixel Units) intrinsics = np.load("recon/split_0/intrinsics.npz")['intrinsics']  # Shape: (frame_num, 3, 3) print("Extrinsics shape:", extrinsics.shape) print("Intrinsics shape:", intrinsics.shape) ``` # OmniWorld-HOI4D Detailed Guide This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-HOI4D** dataset. ## OmniWorld-HOI4D Organisation and File Structure The **OmniWorld-HOI4D** dataset is a collection of re-annotated data derived from the [HOI4D](https://hoi4d.github.io/) dataset. **You need downloading original videos**. > **Important Note:** In this repository, we **only provide the annotated data** (e.g., camera poses, flow, depth, text), and **do not include the raw RGB image files** due to licensing and size constraints. Please refer to the original project for instructions on downloading the raw video data. Our annotations are designed to align with the original video frames. ### Annotation Files The annotation data is packaged in `.tar.gz` files located under `OmniWorld/annotations/OmniWorld-HOI4D/`. * **Naming Convention**: `omniworld_hoi4d__.tar.gz`, where the indices correspond to the scene index range within the metadata file. ### Scene and Split Specifications * **Total Frames**: 300 frames per scene. * **Split Strategy**: Each scene is divided into **1 splits of 300 frames each** for detailed annotation. **Metadata Explained** (`omniworld_hoi4d_metadata.csv`) | Field Name | Description | | :--- | :--- | | `Index` | The sequential index number of the scene. | | `Video Path` | The relative path of the scene in the original HOI4D dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `ZY20210800001/H1/C1/N19/S100/s02/T1`| | `Annotation Path` | The directory name for this scene's annotations inside the extracted `.tar.gz` archive. This is generated by replacing all `/` in the Video Path with `_`. Example: `ZY20210800001_H1_C1_N19_S100_s02_T1`| ## OmniWorld-HOI4D Usage Guide ### 1. Quick-Start: Extracting One Scene To access the annotations for a scene, you first need to extract the corresponding `.tar.gz` archive. After extracting one `omniworld_hoi4d__.tar.gz` file, the resulting folder structure for each individual scene within the archive is as follows: ``` # e.g., ZY20210800001_H1_C1_N19_S100_s02_T1 | ├── camera/ │ ├── recon/ │ │ └── split_0/ │ │ └── info.json # Camera intrinsics and extrinsics for all 300 frames. │ ├── image_list.json # Ordered list of corresponding image filenames. │ └── split_info.json # Defines the frame segmentation (HOI4D is one 300-frame split). | ├── flow/ # Just like OmniWorld-Game. │ ├── 00000/ │ │ ├── flow_u_16.png # Optical flow (horizontal component). │ │ ├── flow_v_16.png # Optical flow (vertical component). │ │ └── flow_vis.png # Visualization of the optical flow. │ ├── 00001/ │ ... (up to frame 299) | ├── prior_depth/ │ ├── 00000.png # Monocular depth map for frame 0. │ ├── 00001.png # Monocular depth map for frame 1. │ ... (up to frame 299) | └── text/ # Just like OmniWorld-Game. ├── 0_80.txt # Text description for frames 0-80. ├── 120_200.txt # Text description for frames 120-200. ... ``` ### 2. Modality Details #### 2.1. Split Information (`split_info.json`) Scene frames are segmented into 300-frame splits for annotation. The mapping and division information is stored in `split_info.json`. Each HOI4D scene consists of a single 300-frame split. #### 2.2 Camera Poses (`info.json`) **Minimal Reader** ```python import json import torch def load_camera_info(info_json_path: str): """ Parses an info.json file to extract camera intrinsics and extrinsics. """ with open(info_json_path, 'r') as f: info_data = json.load(f) # Extrinsics are provided as a list of 4x4 world-to-camera matrices (OpenCV convention) extrinsics = torch.tensor(info_data['extrinsics']) # Shape: (num_frames, 4, 4) num_frames = extrinsics.shape[0] fx, fy, cx, cy = info_data['crop_intrinsic'].values() intrinsic = torch.eye(3) intrinsic[0, 0] = fx intrinsic[0, 2] = cx intrinsic[1, 1] = fy intrinsic[1, 2] = cy # Repeat the intrinsic matrix for each frame intrinsics = intrinsic.unsqueeze(0).repeat(num_frames, 1, 1) # Shape: (num_frames, 3, 3) return intrinsics, extrinsics # Example usage: annotation_path = "ZY20210800001_H1_C1_N19_S100_s02_T1" info_path = f"{annotation_path}/camera/recon/split_0/info.json" intrinsics, extrinsics = load_camera_info(info_path) print("Intrinsics shape:", intrinsics.shape) print("Extrinsics shape:", extrinsics.shape) ``` # OmniWorld-DROID Detailed Guide This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-DROID** dataset. ## OmniWorld-DROID Organisation and File Structure The **OmniWorld-DROID** dataset is a collection of re-annotated data derived from the [DROID](https://droid-dataset.github.io/) dataset. **You need downloading original videos**. > **Important Note:** In this repository, we **only provide the annotated data** (e.g., flow, depth, text, mask), and **do not include the raw RGB image files** due to licensing and size constraints. Please refer to the original project for instructions on downloading the raw video data. Our annotations are designed to align with the original video frames. ### Annotation Files The annotation data is packaged in `.tar.gz` files located under `OmniWorld/annotations/OmniWorld-DROID/`. * **Naming Convention**: `omniworld_droid__.tar.gz`, where the indices correspond to the scene index range within the metadata file. **Metadata Explained** (`omniworld_droid_metadata.csv`) | Field Name | Description | | :--- | :--- | | `Index` | The sequential index number of the scene. | | `Video Path` | The relative path of the scene in the original DROID dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `droid_raw/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/`| | `Annotation Path` | The directory name for this scene's annotations inside the extracted `.tar.gz` archive. Example: `droid_processed/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/`| | `Img Num` | The total number of image frames from one camera perspective in the scene.| ## OmniWorld-DROID Usage Guide ### 1. Quick-Start: Extracting One Scene To access the annotations for a scene, you first need to extract the corresponding `.tar.gz` archive. After extracting one `omniworld_droid__.tar.gz` file, the resulting folder structure for each individual scene within the archive is as follows: ``` / # e.g., droid_processed/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/ | ├── flow/ # Just like OmniWorld-Game │ └── / # e.g., 18026681, 22008760, etc. │ ├── 0/ │ │ ├── flow_u_16.png # Optical flow (horizontal component) for frame 0 │ │ ├── flow_v_16.png # Optical flow (vertical component) for frame 0 │ │ └── flow_vis.png # Visualization of the optical flow for frame 0 │ ├── 1/ │ ... (up to Img Num - 1) | ├── foundation_stereo/ │ └── / │ ├── 0.png # Monocular depth map for frame 0 │ ├── 1.png # Monocular depth map for frame 1 │ ... (up to Img Num - 1) | ├── robot_masks/ # Just like OmniWorld │ └── / │ ├── mask_prompt.json │ └── tracked_masks_coco.json | ├── text/ │ └── / # e.g., ext1_cam_serial, wrist_cam_serial │ ├── 0-161.txt # Short caption for frames 0-161 │ └── 40-201.txt # Short caption for frames 40-201 | ├── _totalcaption.txt # Long-form, summary caption for the entire scene from one camera's perspective ├── meta_info.json # General metadata for the scene ... ``` This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-DROID** dataset. # License The OmniWorld dataset is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)**. By accessing or using this dataset, you agree to be bound by the terms and conditions outlined in this license, as well as the specific provisions detailed below. - **Special Note on Third-Party Content**: A portion of this dataset is derived from third-party game content. All intellectual property rights pertaining to these original game assets (including, but not limited to, RGB and depth images) remain with their respective original game developers and publishers. - **Permitted Uses**: You are hereby granted permission, free of charge, to use, reproduce, and share the OmniWorld dataset and any adaptations thereof, solely for non-commercial research and educational purposes. This includes, but is not limited to: academic publications, algorithm benchmarking, reproduction of scientific results. Under this license, you are expressly **forbidden** from: - Using the dataset, in whole or in part, for any commercial purpose, including but not limited to its incorporation into commercial products, services, or monetized applications. - Redistributing the original third-party game assets contained within the dataset outside the scope of legitimate research sharing. Removing or altering any copyright, license, or attribution notices. The authors of the OmniWorld dataset provide this dataset "as is" and make no representations or warranties regarding the legality of the underlying data for any specific purpose. Users are solely responsible for ensuring that their use of the dataset complies with all applicable laws and the terms of service or license agreements of the original game publishers (sources of third-party content). For the full legal text of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, please visit: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode. # Citation If you found this dataset useful, please cite our paper ```bibtex @misc{zhou2025omniworld, title={OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling}, author={Yang Zhou and Yifan Wang and Jianjun Zhou and Wenzheng Chang and Haoyu Guo and Zizun Li and Kaijing Ma and Xinyue Li and Yating Wang and Haoyi Zhu and Mingyu Liu and Dingning Liu and Jiange Yang and Zhoujie Fu and Junyi Chen and Chunhua Shen and Jiangmiao Pang and Kaipeng Zhang and Tong He}, year={2025}, eprint={2509.12201}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.12201}, } ```