OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling

# OmniWorld

**Repository Path**: hf-datasets/OmniWorld

## Basic Information

- **Project Name**: OmniWorld
- **Description**: Mirror of https://huggingface.co/datasets/InternRobotics/OmniWorld
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-09-30
- **Last Updated**: 2025-10-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

---
license: cc-by-nc-sa-4.0
size_categories:
- n>1T
task_categories:
- text-to-video
- image-to-video
- image-to-3d
- robotics
- other
language:
- en
pretty_name: OmniWorld
---

<h1 align='center'>OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling</h1>
<div align='center'>
    <a href='https://github.com/yangzhou24' target='_blank'>Yang Zhou</a><sup>1</sup> 
    <a href='https://github.com/yyfz' target='_blank'>Yifan Wang</a><sup>1</sup> 
    <a href='https://zhoutimemachine.github.io' target='_blank'>Jianjun Zhou</a><sup>1,2</sup> 
    <a href='https://github.com/AmberHeart' target='_blank'>Wenzheng Chang</a><sup>1</sup> 
    <a href='https://github.com/ghy0324' target='_blank'>Haoyu Guo</a><sup>1</sup> 
    <a href='https://github.com/LiZizun' target='_blank'>Zizun Li</a><sup>1</sup> 
    <a href='https://kaijing.space/' target='_blank'>Kaijing Ma</a><sup>1</sup> 
    
</div>
<div align='center'>
<a href='https://scholar.google.com/citations?user=VuTRUg8AAAAJ' target='_blank'>Xinyue Li</a><sup>1</sup> 
    <a href='https://scholar.google.com/citations?user=5SuBWh0AAAAJ&hl=en' target='_blank'>Yating Wang</a><sup>1</sup> 
    <a href='https://www.haoyizhu.site/' target='_blank'>Haoyi Zhu</a><sup>1</sup> 
    <a href='https://mingyulau.github.io/' target='_blank'>Mingyu Liu</a><sup>1,2</sup> 
    <a href='https://scholar.google.com/citations?user=FbSpETgAAAAJ' target='_blank'>Dingning Liu</a><sup>1</sup> 
    <a href='https://yangjiangeyjg.github.io/' target='_blank'>Jiange Yang</a><sup>1</sup> 
    <a href='https://github.com/Kr1sJFU' target='_blank'>Zhoujie Fu</a><sup>1</sup> 
    
</div>
<div align='center'>
    <a href='https://sotamak1r.github.io/' target='_blank'>Junyi Chen</a><sup>1</sup> 
    <a href='https://cshen.github.io' target='_blank'>Chunhua Shen</a><sup>1,2</sup> 
    <a href='https://oceanpang.github.io' target='_blank'>Jiangmiao Pang</a><sup>1</sup> 
    <a href='https://kpzhang93.github.io/' target='_blank'>Kaipeng Zhang</a><sup>1</sup>
    <a href='https://tonghe90.github.io/' target='_blank'>Tong He</a><sup>1†</sup>
</div>
<div align='center'>
    <sup>1</sup>Shanghai AI Lab  <sup>2</sup>ZJU 
</div>
<br>
<div align="center">
  <a href="https://yangzhou24.github.io/OmniWorld/"><img src="https://img.shields.io/badge/Project Page-5745BB?logo=google-chrome&logoColor=white"></a>  
  <a href="https://arxiv.org/abs/2509.12201"><img src="https://img.shields.io/static/v1?label=Paper&message=Arxiv&color=red&logo=arxiv"></a>  
  <a href="https://github.com/yangzhou24/OmniWorld"><img src="https://img.shields.io/static/v1?label=Code&message=Github&color=blue&logo=github"></a>  
  <a href="https://huggingface.co/datasets/InternRobotics/OmniWorld"><img src="https://img.shields.io/static/v1?label=Dataset&message=HuggingFace&color=yellow&logo=huggingface"></a>  
</div>


# 🎉NEWS
- [2025.10.15] The **OmniWorld-Game Benchmark** is now live on Hugging Face!
- [2025.10.8] The **OmniWorld-HOI4D** and **OmniWorld-DROID** dataset is now live on Hugging Face!
- [2025.9.28] The **OmniWorld-CityWalk** dataset is now live on Hugging Face!
- [2025.9.21] 🔥 The **OmniWorld-Game** dataset now includes **5k splits** in total on Hugging Face!
- [2025.9.17] 🎉 Our dataset was ranked **#1 Paper of the Day** on 🤗 [Hugging Face Daily Papers!](https://huggingface.co/papers/2509.12201)
- [2025.9.16] 🔥 The first **1.2k splits** of **OmniWorld-Game** is now live on Hugging Face! **We will continue to update, more data is coming soon,  Stay tuned!**

# 🧭 Dataset Overview and Navigation

OmniWorld is a multi-domain and multi-modal dataset comprising several distinct sub-datasets. 🙂 indicates the modality is newly (re-)annotated by us, ✅ denotes ground-truth data that already exists in the original dataset, ❌ marks missing modalities.

| Dataset | Domain | # Seq. | FPS | Resolution | # Frames | Depth | Camera | Text | Opt. flow | Fg. masks | Detailed Guide |
| :-- | :-- | --: | --: | :--: | --: | :--: | :--: | :--: | :--: | :--: | :--: |
| OmniWorld-Game | Simulator | 96K | 24 | 1280 × 720 | 18,515K | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 | [→ See guide](#omniworld-game-detailed-guide) |
| AgiBot | Robot | 20K | 30 | 640 × 480 | 39,247K | 🙂 | ✅ | ✅ | ❌ | 🙂 | [TBD] |
| DROID | Robot | 35K | 60 | 1280 × 720 | 26,643K | 🙂 | ✅ | 🙂 | 🙂 | 🙂 | [→ See guide](#omniworld-droid-detailed-guide) |
| RH20T | Robot | 109K | 10 | 640 × 360 | 53,453K | ❌ | ✅ | 🙂 | 🙂 | 🙂 | [TBD] |
| RH20T-Human | Human | 73K | 10 | 640 × 360 | 8,875K | ❌ | ✅ | 🙂 | ❌ | ❌ | [TBD] |
| HOI4D | Human | 2K | 15 | 1920 × 1080 | 891K | 🙂 | 🙂 | 🙂 | 🙂 | ✅ | [→ See guide](#omniworld-hoi4d-detailed-guide) |
| Epic-Kitchens | Human | 15K | 30 | 1280 × 720 | 3,635K | ❌ | 🙂 | 🙂 | ❌ | ❌ | [TBD] |
| Ego-Exo4D | Human | 4K | 30 | 1024 × 1024 | 9,190K | ❌ | ✅ | 🙂 | 🙂 | ❌ | [TBD] |
| HoloAssist | Human | 1K | 30 | 896 × 504 | 13,037K | ❌ | 🙂 | 🙂 | 🙂 | ❌ | [TBD] |
| Assembly101 | Human | 4K | 60 | 1920 × 1080 | 110,831K | ❌ | ✅ | 🙂 | 🙂 | 🙂 | [TBD] |
| EgoDex | Human | 242K | 30 | 1920 × 1080 | 76,631K | ❌ | ✅ | 🙂 | ❌ | ❌ | [TBD] |
| CityWalk | Internet | 7K | 30 | 1280 × 720 | 13,096K | ❌ | 🙂 | ✅ | ❌ | ❌ | [→ See guide](#omniworld-citywalk-detailed-guide) |
| Game-Benchmark | Simulator | - | 24 | 1280 × 720 | - | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 | [→ See guide](#omniworld-game-benchmark-detailed-guide) |

---

# Directory Structure
This structure outlines the organization across all OmniWorld sub-datasets. Each sub-dataset (e.g., OmniWorld-Game, OmniWorld-CityWalk) maintains its unique scene folders within the shared `annotations/`, `metadata/`, and `videos/` top-level directories.

```
DATA_PATH/
├─ annotations/
│  ├─ OmniWorld-Game/
│  │  ├─ b04f88d1f85a/
│  │  ├─ 52e80f590716/
│  │  └─ …                   # one folder per scene
│  ├─ OmniWorld-CityWalk/
│  └─ …
├─ metadata/
│  ├─ OmniWorld-Game_metadata.csv
│  ├─ OmniWorld-CityWalk_metadata.csv
│  └─ …
├─ videos/
│  ├─ OmniWorld-Game/
│  │  ├─ b04f88d1f85a/
│  │  ├─ 52e80f590716/
│  │  └─ …
│  ├─ OmniWorld-CityWalk/
│  └─ …
└─ README.md                # this guide
```


# Dataset Download
You can download the entire OmniWorld dataset using the following command:
```bash
# 1. Install (if you haven't yet)
pip install --upgrade "huggingface_hub[cli]"

# 2. Full download
hf download InternRobotics/OmniWorld \
           --repo-type dataset \
           --local-dir /path/to/DATA_PATH
```
For downloading specific files (instead of the full dataset), please refer to the [dowanload_specific.py](https://github.com/yangzhou24/OmniWorld/blob/main/scripts/dowanload_specific.py) provided in our GitHub repository.

# OmniWorld-Game Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-Game** dataset.

## OmniWorld-Game Organisation and File Structure

To keep the download manageable, each scene is split into multiple `.tar.gz` files:

- RGB / Depth / Flow : ≤ 2 000 images per `.tar.gz`. The naming convention follows the format: `…/<scene_id>_<modality>_<part_idx>.tar.gz`

- Other Annotations: Additional data such as camera poses, masks, and text annotations are grouped together in a single file per scene: `…/<scene_id>_others.tar.gz`

**Metadata Explained** (`omniworld_game_metadata.csv`)
| Field Name          | Description                                                                 |
|---------------------|-----------------------------------------------------------------------------|
| `UID`               | Scene ID (folder name). |
| `Video Path`        | Relative path to the RGB frames.      |
| `Annotation Path`   | Relative path to all multimodal annotations.|
| `Split Img Num`     | Frame count across all splits of the scene.                                   |
| `Split Num`         | Number of splits the scene was cut into.                                 |
| `Total Img Num`        | Raw frame count before splitting.                           |
| `Test Split Index`   | Zero-based indices of splits used for the test set (comma-separated). Blank = no test split. Example: "0,5" marks the `split_0`, `split_5` as test data.          |
| `FPS`   | Frames per second.                      |
| `Resolution`      | `width×height` in pixels.              |

## OmniWorld-Game Usage Guide

### 1. Quick-Start: Extracting One Scene
Below we extract RGB frames and all annotations for scene `<scene_id>` to a local folder of the same name.
```bash
scene_id=b04f88d1f85a
root=/path/to/DATA_PATH        # where you store OmniWorld

mkdir -p ${scene_id}

# --- RGB (may span several parts) ------------------------------------------
for rgb_tar in ${root}/videos/OmniWorld-Game/${scene_id}/${scene_id}_rgb_*.tar.gz
do
    echo "Extracting $(basename $rgb_tar)…"
    tar -xzf "$rgb_tar" -C ${scene_id}
done

# --- Depth -----------------------------------------------------------------
for d_tar in ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}_depth_*.tar.gz
do
    echo "Extracting $(basename $d_tar)…"
    tar -xzf "$d_tar" -C ${scene_id}
done

# --- Flow ------------------------------------------------------------------
for f_tar in ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}_flow_*.tar.gz
do
    echo "Extracting $(basename $f_tar)…"
    tar -xzf "$f_tar" -C ${scene_id}
done

# --- All other annotations --------------------------------------
tar -xzf ${root}/annotations/OmniWorld-Game/${scene_id}/${scene_id}_others.tar.gz -C ${scene_id}
```
Resulting Scene Folder: 
```
b04f88d1f85a/
├─ color/              # RGB frames (.png)
├─ depth/              # 16-bit depth maps
├─ flow/               # flow_u_16.png / flow_v_16.png / flow_vis.png
├─ camera/             # split_*.json (intrinsics + extrinsics)
├─ subject_masks/      # foreground masks (per split)
├─ gdino_mask/         # dynamic-object masks (per frame)
├─ text/               # structured captions (81-frame segments)
├─ droidclib/          # coarse camera poses (if you need them)
├─ fps.txt             # source video framerate
└─ split_info.json     # how frames are grouped into splits
```

### 2. Modality Details

#### 2.1. Split Information (`split_info.json`)

Each scene is divided into several high-quality "splits". `split_info.json` tells you how the original video indices are grouped.

```
{
  "scene_name": "b04f88d1f85a",
  "split_num": 6,
  "split": [
    [0, 1, 2, ...],          // split_0
    [316, 317, ...],         // split_1
    ...
  ]
}
```
Meaning:

- `split_num` – total number of splits in this scene.
- `split[i]` – an array with the original frame indices belonging to `split i`.
#### 2.2. Camera Poses (`camera/split_<idx>.json`)

For every split you will find a file
```
<scene_name>/camera/split_<idx>.json   (e.g. split_0.json)
```
containing:
- `focals` – focal length in pixels (same for x and y).
- `cx, cy` – principal point.
- `quats` – per-frame rotation as quaternions (w, x, y, z).
- `trans` – per-frame translation (x, y, z).

**Minimal Reader**

```python
import json
from pathlib import Path

import numpy as np
from scipy.spatial.transform import Rotation as R


def load_split_info(scene_dir: Path):
    """Return the split json dict."""
    with open(scene_dir / "split_info.json", "r", encoding="utf-8") as f:
        return json.load(f)


def load_camera_poses(scene_dir: Path, split_idx: int):
    """
    Returns
    -------
    intrinsics : (S, 3, 3) array, pixel-space K matrices
    extrinsics : (S, 4, 4) array, OpenCV world-to-camera matrices
    """
    # ----- read metadata -----------------------------------------------------
    split_info = load_split_info(scene_dir)
    frame_count = len(split_info["split"][split_idx])

    cam_file = scene_dir / "camera" / f"split_{split_idx}.json"
    with open(cam_file, "r", encoding="utf-8") as f:
        cam = json.load(f)

    # ----- intrinsics --------------------------------------------------------
    intrinsics = np.repeat(np.eye(3)[None, ...], frame_count, axis=0)
    intrinsics[:, 0, 0] = cam["focals"]          # fx
    intrinsics[:, 1, 1] = cam["focals"]          # fy
    intrinsics[:, 0, 2] = cam["cx"]              # cx
    intrinsics[:, 1, 2] = cam["cy"]              # cy

    # ----- extrinsics --------------------------------------------------------
    extrinsics = np.repeat(np.eye(4)[None, ...], frame_count, axis=0)

    # SciPy expects quaternions as (x, y, z, w) → convert
    quat_wxyz = np.array(cam["quats"])           # (S, 4)  (w,x,y,z)
    quat_xyzw = np.concatenate([quat_wxyz[:, 1:], quat_wxyz[:, :1]], axis=1)

    rotations = R.from_quat(quat_xyzw).as_matrix()      # (S, 3, 3)
    translations = np.array(cam["trans"])               # (S, 3)

    extrinsics[:, :3, :3] = rotations
    extrinsics[:, :3, 3] = translations

    return intrinsics.astype(np.float32), extrinsics.astype(np.float32)


# --------------------------- example usage -----------------------------------
if __name__ == "__main__":
    scene = Path("b04f88d1f85a")   # adjust to your path
    K, w2c = load_camera_poses(scene, split_idx=0)      # world-to-camera transform in OpenCV format
    print("Intrinsics shape:", K.shape)
    print("Extrinsics shape:", w2c.shape)
```

#### 2.3. Depth (`depth/<frame_idx>.png`)

- 16-bit PNG, one file per RGB frame.
- Values are stored as unsigned integers in [0, 65535].

   &ensp;&ensp;&ensp;`0 … 100`  ≈ invalid / too close

   &ensp;&ensp;&ensp;`65500 … 65535` ≈ sky / too far

**Minimal Reader**

```python
import imageio.v2 as iio
import numpy as np
from pathlib import Path


def load_depth(depthpath):
    """
    Returns
    -------
    depthmap : (H, W) float32
    valid   : (H, W) bool      True for reliable pixels
    """

    depthmap = imageio.v2.imread(depthpath).astype(np.float32) / 65535.0
    near_mask = depthmap < 0.0015   # 1. too close
    far_mask = depthmap > (65500.0 / 65535.0) # 2. filter sky
    # far_mask = depthmap > np.percentile(depthmap[~far_mask], 95) # 3. filter far area (optional)
    near, far = 1., 1000.
    depthmap = depthmap / (far - depthmap * (far - near)) / 0.004

    valid = ~(near_mask | far_mask)
    depthmap[~valid] = -1

    return depthmap, valid

# ---------------------------- example ---------------------------------------
if __name__ == "__main__":
    d, mask = load_depth("b04f88d1f85a/depth/000000.png")
    print("Depth shape:", d.shape, "valid pixels:", mask.mean() * 100, "%")

```
Feel free to tighten the `far_mask` with `np.percentile(depthmap[~far_mask], 95)` if you need a stricter “too-far” criterion.

> We provide a script to generate a fused point cloud from camera poses and depth maps. Instructions can be found in the [Point Cloud Visualization](https://github.com/yangzhou24/OmniWorld?tab=readme-ov-file#-visualize-as-point-cloud) section from our github repository.

#### 2.4. Structured Caption (`text/<start_idx>_<end_idx>.json`)

From every split we sample `81` frames and attach rich, structured captions.

The general naming format of the text file is `<start_idx>_<end_idx>.json`, which means that the text is the description of the `start_idx` frame to the `end_idx` frame of the global video.

Each text file contains the following description information
- `Short_Caption`: A brief summary (1–2 sentences).
- `PC_Caption`: Actions and status of the player-character.
- `Background_Caption`: Fine-grained spatial description of the scene.
- `Camera_Caption`: How the camera moves, such as zooms, rotates.
- `Video_Caption`: ≈200-word dense paragraph combining all above..
- `Key_Tags`: string of tags that combines key features.

#### 2.5. Foreground Masks (`subject_masks/split_<idx>.json`)
Binary masks (white = subject, black = background) for every frame in a split. Main masked object includes:

- `Human/Robotics` scenes: the active arm / robot.
- `Game` scenes: the playable character or vehicle.

**Minimal Reader**
```python
import json
from pathlib import Path
from pycocotools import mask as mask_utils
import numpy as np

def load_subject_masks(scene_dir: Path, split_idx: int):
    """
    Returns
    -------
    masks : list[np.ndarray]  (H, W) bool
    """
    seg_mask_list = []
    segmask_path = scene_dir / "subject_masks" / f"split_{split_idx}.json"
    with open(segmask_path, "r", encoding="utf-8") as f:
        seg_masks = json.load(f)
    for key in seg_masks.keys():
        seg_mask = seg_masks[key]
        seg_mask = mask_utils.decode(seg_mask["mask_rle"])
        seg_mask_list.append(seg_mask)

    seg_mask_list

# ---------------------------- example ---------------------------------------
if __name__ == "__main__":
    masks = load_subject_masks(Path("b04f88d1f85a"), split_idx=0)
    print("Loaded", len(masks), "masks of shape", masks[0].shape)
```
We also release per-frame Dynamic Masks (`gdino_mask/<frame_idx>.png`). Each RGB image in the original video is labeled with dynamic objects (such as cars, people, and animals). White represents dynamic objects, and black represents static backgrounds. This can be used in conjunction with Foreground Masks as needed.


#### 2.6. Optical Flow (`flow/<frame_idx>/...`)

For every RGB frame `t` we provide dense forward optical flow that points to frame `t + 1`.

Directory layout (example for frame 0 of scene `b04f88d1f85a`)
```
b04f88d1f85a/
└─ flow/
   └─ 00000/
      ├─ flow_u_16.png   # horizontal component  (u, Δx)
      ├─ flow_v_16.png   # vertical component    (v, Δy)
      └─ flow_vis.png    # ready-made RGB visualisation (for inspection only)
```
**Minimal Reader**
```python
import numpy as np
import imageio.v2 as iio
from pathlib import Path

FLOW_MIN, FLOW_MAX = -300.0, 300.0           # change if you override the range

def flow_decompress(u, v, flow_min=-FLOW_MIN, flow_max=FLOW_MAX):
    """
    Read uint16 image and convert back to optical flow data

    Args:
        u: np.array (np.uint16) - Optical flow horizontal component
        v: np.array (np.uint16) - Optical flow vertical component
        flow_min: float - Assumed minimum value of optical flow
        flow_max: float - Assumed maximum value of optical flow

    Returns:
        np.array (np.float32) - Optical flow data with shape (H,W,2)
    """
    u = u.astype(np.uint16)
    v = v.astype(np.uint16)

    u = u / 65535.0
    v = v / 65535.0

    u = u * (flow_max - flow_min) + flow_min
    v = v * (flow_max - flow_min) + flow_min

    res = np.stack((u, v), axis=-1)

    return res.astype(np.float32)

def load_flow(flowpath):
    of_u_path = os.path.join(flowpath, "flow_u_16.png")
    of_v_path = os.path.join(flowpath, "flow_v_16.png")

    u = cv2.imread(str(of_u_path)).astype(np.uint16)
    v = cv2.imread(str(of_v_path)).astype(np.uint16)
    flow = flow_decompress(u, v)

    return flow

# ---------------------------- example ---------------------------------------
if __name__ == "__main__":
    flow = load_flow("b04f88d1f85a/flow/00000")
    print("Flow shape: ", flow.shape)
```


# OmniWorld-Game Benchmark Detailed Guide

The OmniWorld-Game Benchmark is a curated subset of test splits, specifically selected from the OmniWorld-Game dataset to serve as a challenging evaluation platform, as detailed in our [paper](https://arxiv.org/abs/2509.12201).


| Task | Sequence Length | Duration | Key Modalities |
| :-- | :-- | --: | --: |
| Geometric Prediction | 384 frames | 16 seconds| RGB, Depth, Camera Poses |
| Video Generation | 81 frames | 3.4 seconds| RGB, Depth, Camera Poses, Text |

Each sequence in the benchmark is challenging, featuring rich dynamics that accurately reflect real-world complexity. They are accompanied by high-fidelity ground-truth annotations for camera poses and depth.

## Data Access and Organization

The benchmark annotation data is packaged into `.tar.gz` files located under the `OmniWorld/benchmark` directory. Each archive is named in the format `<UID>_<split_index>.tar.gz`. 

## Extracted Directory Structure
```
<UID>_<split_index>/
├─ depth/
│  ├─ 000000.npy       # (H, W) Depth map. Already processed and stored using the OmniWorld-Game Depth reading method.
│  ├─ 000001.npy
│  └─ ...
├─ image/              # High-resolution RGB frames (720×1280 pixels)
│  ├─ 000000.png
│  ├─ 000001.png
│  └─ ...
├─ camera_poses.npy    # (num_frames, 4, 4) Camera-to-World (C2W) transformation matrices.
├─ intrinsics.npy      # (num_frames, 3, 3) Intrinsic camera matrices in pixel space.
├─ text_caption.json   # The structured text caption associated with the sequence.
└─ video.mp4           # MP4 video file corresponding to the PNG frames in the 'image/' directory.
```

The depth maps are already processed and stored using the OmniWorld-Game Depth reading method.

# OmniWorld-CityWalk Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-CityWalk** dataset.

## OmniWorld-CityWalk Organisation and File Structure

The **OmniWorld-CityWalk** dataset is a collection of re-annotated data derived from a subset of the [Sekai-Real-Walking-HQ](https://github.com/Lixsp11/sekai-codebase) dataset. You need [downloading original videos](https://github.com/Lixsp11/sekai-codebase/tree/main/dataset_downloading) and [extracting video clips](https://github.com/Lixsp11/sekai-codebase/tree/main/clip_extracting).

> **Important Note:** In this repository, we **only provide the annotated data** (e.g., camera poses, dynamic masks), and **do not include the raw RGB image files** due to licensing and size constraints. Please refer to the original project for instructions on downloading and splitting the raw video data. Our annotations are designed to align with the original video frames. 

### Annotation Files

The camera annotation data is packaged in `.tar.gz` files located under `OmniWorld/annotations/OmniWorld-CityWalk/`.

* **Naming Convention**: `omniworld_citywalk_<start_scene_index>_<end_scene_index>.tar.gz`, where the indices correspond to the scene index range within the metadata file.

### Scene and Split Specifications

* **Video Length**: Each source video scene is 60 seconds long.
* **Frame Rate**: 30 FPS.
* **Total Frames**: 1800 frames per scene.
* **Split Strategy**: Each scene is divided into **6 splits of 300 frames each** for detailed annotation.

**Metadata Explained** (`omniworld_citywalk_metadata.csv`)
| Field Name | Description |
| :--- | :--- |
| `index` | The sequential index number of the scene. |
| `videoFile` | The video file name, formatted as `<scene_id>_<start_frame>_<end_frame>`. The corresponding source video on YouTube can be accessed via `https://www.youtube.com/watch?v=<scene_id>`. |
| `cameraFile` | The directory name for the camera annotation data, which is named after the video file. |
| `caption` | The dense text description/caption for the video segment. |
| `location` | The geographical location where the video was filmed. |
| `crowdDensity` | An assessment of the crowd/people density within the video. |
| `weather` | The general weather condition (e.g., sunny, overcast). |
| `timeOfDay` | The time of day when the video was recorded (e.g., morning, afternoon). |

## OmniWorld-CityWalk Usage Guide

### 1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding `.tar.gz` archive. After extracting one `omniworld_citywalk_<start_scene_index>_<end_scene_index>.tar.gz` file, the resulting folder structure for each individual scene within the archive is as follows:
```
xpPEhccDNak_0023550_0025350/  # Example Scene name (videoFile)
├─ gdino_mask/          # Per-frame dynamic-object masks (.png)
├─ recon/               # Camera and 3D reconstruction data per split
│  ├─ split_0/
│  │  ├─ extrinsics.npz # Per-frame camera extrinsics: (frame_num, 3, 4) in OpenCV world-to-camera format
│  │  ├─ intrinsics.npz # Per-frame camera intrinsics: (frame_num, 3, 3) in pixel units
│  │  └─ points3D_ba.ply # Sparse and accurate point cloud data after Bundle Adjustment (BA) for this split
│  ├─ split_1/
│  │  └─ ...
|  └─ ...
├─ image_list.json      # Defines the frame naming convention (e.g., 000000.png to 001799.png)
└─ split_info.json      # Records how frames are grouped into 300-frame splits
```

### 2. Modality Details

#### 2.1. Split Information (`split_info.json`)

Scene frames are segmented into 300-frame splits for annotation. The mapping and division information is stored in `split_info.json`.

#### 2.2. Camera Poses (`recon/split_<idx>/...`)

Camera poses are provided as NumPy compressed files (`.npz`) containing the extrinsics (world-to-camera rotation and translation) and intrinsics (focal length and principal point).

**Minimal Reader**

```python
import numpy as np

# Load Extrinsics (World-to-Camera Transform in OpenCV format)
extrinsics = np.load("recon/split_0/extrinsics.npz")['extrinsics']  # Shape: (frame_num, 3, 4)

# Load Intrinsics (in Pixel Units)
intrinsics = np.load("recon/split_0/intrinsics.npz")['intrinsics']  # Shape: (frame_num, 3, 3)

print("Extrinsics shape:", extrinsics.shape)
print("Intrinsics shape:", intrinsics.shape)
```

# OmniWorld-HOI4D Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-HOI4D** dataset.

## OmniWorld-HOI4D Organisation and File Structure

The **OmniWorld-HOI4D** dataset is a collection of re-annotated data derived from the [HOI4D](https://hoi4d.github.io/) dataset. **You need downloading original videos**.

> **Important Note:** In this repository, we **only provide the annotated data** (e.g., camera poses, flow, depth, text), and **do not include the raw RGB image files** due to licensing and size constraints. Please refer to the original project for instructions on downloading the raw video data. Our annotations are designed to align with the original video frames.

### Annotation Files

The annotation data is packaged in `.tar.gz` files located under `OmniWorld/annotations/OmniWorld-HOI4D/`.

* **Naming Convention**: `omniworld_hoi4d_<start_scene_index>_<end_scene_index>.tar.gz`, where the indices correspond to the scene index range within the metadata file.

### Scene and Split Specifications

* **Total Frames**: 300 frames per scene.
* **Split Strategy**: Each scene is divided into **1 splits of 300 frames each** for detailed annotation.

**Metadata Explained** (`omniworld_hoi4d_metadata.csv`)
| Field Name | Description |
| :--- | :--- |
| `Index` | The sequential index number of the scene. |
| `Video Path` | The relative path of the scene in the original HOI4D dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `ZY20210800001/H1/C1/N19/S100/s02/T1`|
| `Annotation Path` | The directory name for this scene's annotations inside the extracted `.tar.gz` archive. This is generated by replacing all `/` in the Video Path with `_`. Example: `ZY20210800001_H1_C1_N19_S100_s02_T1`|

## OmniWorld-HOI4D Usage Guide

### 1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding `.tar.gz` archive. After extracting one `omniworld_hoi4d_<start_scene_index>_<end_scene_index>.tar.gz` file, the resulting folder structure for each individual scene within the archive is as follows:
```
<Annotation Path>
# e.g., ZY20210800001_H1_C1_N19_S100_s02_T1
|
├── camera/
│   ├── recon/
│   │   └── split_0/
│   │       └── info.json        # Camera intrinsics and extrinsics for all 300 frames.
│   ├── image_list.json          # Ordered list of corresponding image filenames.
│   └── split_info.json          # Defines the frame segmentation (HOI4D is one 300-frame split).
|
├── flow/                        # Just like OmniWorld-Game.
│   ├── 00000/
│   │   ├── flow_u_16.png        # Optical flow (horizontal component). 
│   │   ├── flow_v_16.png        # Optical flow (vertical component).
│   │   └── flow_vis.png         # Visualization of the optical flow.
│   ├── 00001/
│   ... (up to frame 299)
|
├── prior_depth/
│   ├── 00000.png               # Monocular depth map for frame 0.
│   ├── 00001.png               # Monocular depth map for frame 1.
│   ... (up to frame 299)
|
└── text/                        # Just like OmniWorld-Game.
    ├── 0_80.txt                 # Text description for frames 0-80.
    ├── 120_200.txt              # Text description for frames 120-200.
    ...
```

### 2. Modality Details

#### 2.1. Split Information (`split_info.json`)

Scene frames are segmented into 300-frame splits for annotation. The mapping and division information is stored in `split_info.json`. Each HOI4D scene consists of a single 300-frame split.

#### 2.2 Camera Poses (`info.json`)

**Minimal Reader**

```python
import json
import torch

def load_camera_info(info_json_path: str):
    """
    Parses an info.json file to extract camera intrinsics and extrinsics.
    """
    with open(info_json_path, 'r') as f:
        info_data = json.load(f)

    # Extrinsics are provided as a list of 4x4 world-to-camera matrices (OpenCV convention)
    extrinsics = torch.tensor(info_data['extrinsics'])  # Shape: (num_frames, 4, 4)
    
    num_frames = extrinsics.shape[0]

    fx, fy, cx, cy = info_data['crop_intrinsic'].values()
    intrinsic = torch.eye(3)
    intrinsic[0, 0] = fx
    intrinsic[0, 2] = cx
    intrinsic[1, 1] = fy
    intrinsic[1, 2] = cy
    
    # Repeat the intrinsic matrix for each frame
    intrinsics = intrinsic.unsqueeze(0).repeat(num_frames, 1, 1)  # Shape: (num_frames, 3, 3)
    
    return intrinsics, extrinsics

# Example usage:
annotation_path = "ZY20210800001_H1_C1_N19_S100_s02_T1"
info_path = f"{annotation_path}/camera/recon/split_0/info.json"
intrinsics, extrinsics = load_camera_info(info_path)

print("Intrinsics shape:", intrinsics.shape)
print("Extrinsics shape:", extrinsics.shape)
```

# OmniWorld-DROID Detailed Guide

This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-DROID** dataset.

## OmniWorld-DROID Organisation and File Structure

The **OmniWorld-DROID** dataset is a collection of re-annotated data derived from the [DROID](https://droid-dataset.github.io/) dataset. **You need downloading original videos**.

> **Important Note:** In this repository, we **only provide the annotated data** (e.g., flow, depth, text, mask), and **do not include the raw RGB image files** due to licensing and size constraints. Please refer to the original project for instructions on downloading the raw video data. Our annotations are designed to align with the original video frames.

### Annotation Files

The annotation data is packaged in `.tar.gz` files located under `OmniWorld/annotations/OmniWorld-DROID/`.

* **Naming Convention**: `omniworld_droid_<start_scene_index>_<end_scene_index>.tar.gz`, where the indices correspond to the scene index range within the metadata file.

**Metadata Explained** (`omniworld_droid_metadata.csv`)
| Field Name | Description |
| :--- | :--- |
| `Index` | The sequential index number of the scene. |
| `Video Path` | The relative path of the scene in the original DROID dataset. Use this path to locate the corresponding source RGB video that you have downloaded. Example: `droid_raw/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/`|
| `Annotation Path` | The directory name for this scene's annotations inside the extracted `.tar.gz` archive. Example: `droid_processed/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/`|
| `Img Num` | The total number of image frames from one camera perspective in the scene.|

## OmniWorld-DROID Usage Guide

### 1. Quick-Start: Extracting One Scene

To access the annotations for a scene, you first need to extract the corresponding `.tar.gz` archive. After extracting one `omniworld_droid_<start_scene_index>_<end_scene_index>.tar.gz` file, the resulting folder structure for each individual scene within the archive is as follows:
```
<Annotation Path>/
# e.g., droid_processed/1.0.1/TRI/success/2023-10-17/Tue_Oct_17_17:20:55_2023/
|
├── flow/                        # Just like OmniWorld-Game
│   └── <camera_serial_id>/      # e.g., 18026681, 22008760, etc.
│       ├── 0/
│       │   ├── flow_u_16.png    # Optical flow (horizontal component) for frame 0
│       │   ├── flow_v_16.png    # Optical flow (vertical component) for frame 0
│       │   └── flow_vis.png     # Visualization of the optical flow for frame 0
│       ├── 1/
│       ... (up to Img Num - 1)
|
├── foundation_stereo/
│   └── <camera_serial_id>/
│       ├── 0.png                # Monocular depth map for frame 0
│       ├── 1.png                # Monocular depth map for frame 1
│       ... (up to Img Num - 1)
|
├── robot_masks/                 # Just like OmniWorld
│   └── <camera_serial_id>/
│       ├── mask_prompt.json
│       └── tracked_masks_coco.json
|
├── text/
│   └── <camera_name>/           # e.g., ext1_cam_serial, wrist_cam_serial
│       ├── 0-161.txt            # Short caption for frames 0-161
│       └── 40-201.txt           # Short caption for frames 40-201
|
├── <camera_name>_totalcaption.txt   # Long-form, summary caption for the entire scene from one camera's perspective
├── meta_info.json                   # General metadata for the scene
...
```

This section provides detailed organization, metadata, and usage instructions specific to the **OmniWorld-DROID** dataset.

# License
The OmniWorld dataset is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)**. By accessing or using this dataset, you agree to be bound by the terms and conditions outlined in this license, as well as the specific provisions detailed below.

- **Special Note on Third-Party Content**:
A portion of this dataset is derived from third-party game content. All intellectual property rights pertaining to these original game assets (including, but not limited to, RGB and depth images) remain with their respective original game developers and publishers.

- **Permitted Uses**:
You are hereby granted permission, free of charge, to use, reproduce, and share the OmniWorld dataset and any adaptations thereof, solely for non-commercial research and educational purposes. This includes, but is not limited to: academic publications, algorithm benchmarking, reproduction of scientific results.

Under this license, you are expressly **forbidden** from:

- Using the dataset, in whole or in part, for any commercial purpose, including but not limited to its incorporation into commercial products, services, or monetized applications.

- Redistributing the original third-party game assets contained within the dataset outside the scope of legitimate research sharing.
Removing or altering any copyright, license, or attribution notices.

The authors of the OmniWorld dataset provide this dataset "as is" and make no representations or warranties regarding the legality of the underlying data for any specific purpose. Users are solely responsible for ensuring that their use of the dataset complies with all applicable laws and the terms of service or license agreements of the original game publishers (sources of third-party content).

For the full legal text of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, please visit: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.

# Citation
If you found this dataset useful, please cite our paper
```bibtex
@misc{zhou2025omniworld,
      title={OmniWorld: A Multi-Domain and Multi-Modal Dataset for 4D World Modeling}, 
      author={Yang Zhou and Yifan Wang and Jianjun Zhou and Wenzheng Chang and Haoyu Guo and Zizun Li and Kaijing Ma and Xinyue Li and Yating Wang and Haoyi Zhu and Mingyu Liu and Dingning Liu and Jiange Yang and Zhoujie Fu and Junyi Chen and Chunhua Shen and Jiangmiao Pang and Kaipeng Zhang and Tong He},
      year={2025},
      eprint={2509.12201},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.12201}, 
}
```