ICCV 2023 HoloAssist: an egocentric human interaction dataset for interactive ai assistants in the real world

The codebase provides guidelines for using the HoloAssist dataset and running the benchmarks.

[Project Website][paper][data]

Download the data and annotations

We release the dataset under the [CDLAv2] license, a permissive license. You can download the data and annotations via the links in the text files below. You can either downloading the data through your web browser or using [Azcopy] which will be faster.

Data links:

Annotation links:

Install Azcopy and download data via Azcopy in Linux.

Please refer to the official manual of using Azcopy in other OS.

- wget -O azcopy.tar.gz https://aka.ms/downloadazcopy-v10-linux
- tar -xvf azcopy.tar.gz
- sudo mv azcopy_linux_amd64_*/azcopy /usr/bin
- azcopy --version

Downloading the data

- azcopy copy "<data_url>" "<local_directory>" --recursive

Dataset Structure

Once the dataset is downloaded and decompressed. You will see the dataset structure as follows. Each folder contains data for one recording session. Within each folder, you will see the data for different modalities. The text files with "_synced" in the names are synced according to the RGB modality as each modality has different sensor rate and we use the synced modalities in the experiments.

We collected our dataset using PSI studio. More detailed information regarding the data format is in here.

  .
  ├── R007-7July-DSLR
  │   └── Export_py
  │       │── AhatDepth
  │       │   ├── 000000.png
  │       │   ├── 000001.png
  │       │   ├── ...
  │       │   ├── AhatDepth_synced.txt
  │       │   ├── Instrinsics.txt
  │       │   ├── Pose_sync.txt
  │       │   └── Timing_sync.txt
  │       ├── Eyes
  │       │   └── Eyes_sync.txt 
  │       ├── Hands
  │       │   ├── Left_sync.txt
  │       │   └── Right_sync.txt 
  │       ├── Head
  │       │   └── Head_sync.txt 
  │       ├── IMU
  │       │   ├── Accelerometer_sync.txt
  │       │   ├── Gyroscope_sync.txt
  │       │   └── Magnetometer_sync.txt
  │       ├── Video
  │       │   ├── Pose_sync.txt
  │       │   ├── Instrinsincs.txt
  │       │   └── VideoMp4Timing.txt
  │       ├── Video_pitchshift.mp4
  │       └── Video_compress.mp4
  ├── R012-7July-Nespresso/
  ├── R013-7July-Nespresso/
  ├── R014-7July-DSLR/
  └── ...

Annotation Structure

We have released both the annotations in the raw format and the processed format. We also provide the train, validation and test splits.

In the raw annotations, each annotation follows

{
    "id": int, original label id,
    "label": "Narration", "Conversation", "Fine grained action",  or "Coarse grained action", 
    "start": start time in seconds, 
    "end": end time in seconds, 
    "type":"range",
    "attributes":{
        Different from different label task. See below.
    },
},

Attributes for Narration

    "id": int, original label id,
    "label": "Narration",  
    "start": start time in seconds, 
    "end": end time in seconds, 
    "type":"range",
    "attributes": {
        "Long-form description": Use multiple sentences and make this as long as is necessary to be exhaustive. There are a finite number of scenarios across all videos, so make sure to call out the distinctive changes between videos, in particular, mistakes that the task performer makes in the learning process that are either self-corrected or corrected by the instructor.
    }, 

Attributes for Conversation

    "id": int, original label id,
    "label": "Narration",  
    "start": start time in seconds, 
    "end": end time in seconds, 
    "type":"range",
    "attributes": {
        "Conversation Purpose":"instructor-start-conversation_other",
        "Transcription":"*unintelligible*",
        "Transcription Confidence":"low-confidence-transcription",
    }, 

Attributes for Fine grained action

    "id": int, original label id,
    "label": "Fine grained action",  
    "start": start time in seconds, 
    "end": end time in seconds, 
    "type":"range",
    "attributes": {
        "Action Correctness":"Correct Action",
        "Incorrect Action Explanation":"none",
        "Incorrect Action Corrected by":"none",
        "Verb":"approach",
        "Adjective":"none",
        "Noun":"gopro",
        "adverbial":"none"
    }, 

Attributes for Coarse grained action

    "id": int, original label id,
    "label": "Coarse grained action",  
    "start": start time in seconds, 
    "end": end time in seconds, 
    "type":"range",
    "attributes": {
        "Action sentence":"The student changes the battery for the GoPro.",
        "Verb":"exchange",
        "Adjective":"none",
        "Noun":"battery"
    }, 

To convert the raw annotation into the format we used in the benchmark experiments, you can either run the label processing script or use our processed the labels in the links above.

Citation

If you find the code or data useful. Please consider cite the paper at

@inproceedings{wang2023holoassist,
  title={Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world},
  author={Wang, Xin and Kwon, Taein and Rad, Mahdi and Pan, Bowen and Chakraborty, Ishani and Andrist, Sean and Bohus, Dan and Feniello, Ashley and Tekin, Bugra and Frujeri, Felipe Vieira and others},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={20270--20281},
  year={2023}
}