The codebase provides guidelines for using the HoloAssist dataset and running the benchmarks.
We release the dataset under the [CDLAv2] license, a permissive license.
Once the dataset is downloaded and decompressed. You will see the dataset structure as follows. Each folder contains data for one recording session. Within each folder, you will see the data for different modalities. The text files with "_synced" in the names are synced according to the RGB modality as each modality has different sensor rate and we use the synced modalities in the experiments.
We collected our dataset using PSI studio. More detailed information regarding the data format is in here.
. ├── R007-7July-DSLR │ └── Export_py │ │── AhatDepth │ │ ├── 000000.png │ │ ├── 000001.png │ │ ├── ... │ │ ├── AhatDepth_synced.txt │ │ ├── Instrinsics.txt │ │ ├── Pose_sync.txt │ │ └── Timing_sync.txt │ ├── Eyes │ │ └── Eyes_sync.txt │ ├── Hands │ │ ├── Left_sync.txt │ │ └── Right_sync.txt │ ├── Head │ │ └── Head_sync.txt │ ├── IMU │ │ ├── Accelerometer_sync.txt │ │ ├── Gyroscope_sync.txt │ │ └── Magnetometer_sync.txt │ ├── Video │ │ ├── Pose_sync.txt │ │ ├── Instrinsincs.txt │ │ └── VideoMp4Timing.txt │ ├── Video_pitchshift.mp4 │ └── Video_compress.mp4 ├── R012-7July-Nespresso/ ├── R013-7July-Nespresso/ ├── R014-7July-DSLR/ └── ...
We have released both the annotations in the raw format and the processed format. We also provide the train, validation and test splits.
In the raw annotations, each annotation follows
{ "id": int, original label id, "label": "Narration", "Conversation", "Fine grained action", or "Coarse grained action", "start": start time in seconds, "end": end time in seconds, "type":"range", "attributes":{ Different from different label task. See below. }, },
Attributes for Narration
"id": int, original label id, "label": "Narration", "start": start time in seconds, "end": end time in seconds, "type":"range", "attributes": { "Long-form description": Use multiple sentences and make this as long as is necessary to be exhaustive. There are a finite number of scenarios across all videos, so make sure to call out the distinctive changes between videos, in particular, mistakes that the task performer makes in the learning process that are either self-corrected or corrected by the instructor. },
Attributes for Conversation
"id": int, original label id, "label": "Narration", "start": start time in seconds, "end": end time in seconds, "type":"range", "attributes": { "Conversation Purpose":"instructor-start-conversation_other", "Transcription":"*unintelligible*", "Transcription Confidence":"low-confidence-transcription", },
Conversation Purpose: Select an option that best describes the purpose of the speech. This is limited to the individual speaking and does not include any pause time waiting for a response.
Transcriptions: Transcribe the conversation into texts.
Transcription Confidence: Confidence for the human annotator at translating the speech to text.
Attributes for Fine grained action
"id": int, original label id, "label": "Fine grained action", "start": start time in seconds, "end": end time in seconds, "type":"range", "attributes": { "Action Correctness":"Correct Action", "Incorrect Action Explanation":"none", "Incorrect Action Corrected by":"none", "Verb":"approach", "Adjective":"none", "Noun":"gopro", "adverbial":"none" },
Action Correctness: Indicate whether the action is correct or a mistake to achieve the task. The options are
Incorrect Action Explanation: Provided by the human annotators to explain why the believe the action is wrong.
Incorrect Action Correct by: Indicate whether the wrong action is later corrected by the instructor or the task performer.
Verb, Adjective, Noun, adverbial: Verb, (adjective), Noun, (adverbial) for describing the fine-grained actions.
Attributes for Coarse grained action
"id": int, original label id, "label": "Coarse grained action", "start": start time in seconds, "end": end time in seconds, "type":"range", "attributes": { "Action sentence":"The student changes the battery for the GoPro.", "Verb":"exchange", "Adjective":"none", "Noun":"battery" },
Citation
If you find the code or data useful. Please consider cite the paper at
@inproceedings{wang2023holoassist, title={Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world}, author={Wang, Xin and Kwon, Taein and Rad, Mahdi and Pan, Bowen and Chakraborty, Ishani and Andrist, Sean and Bohus, Dan and Feniello, Ashley and Tekin, Bugra and Frujeri, Felipe Vieira and others}, booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages={20270--20281}, year={2023} }