Our repository of egocentric activity datasets!
This page captures our effort on GTEA dataset series.
Our latest and largest version is EGTEA Gaze+ dataset.
We are working on further developing EGTEA Gaze+. Stay tuned!
Alireza Fathi, Xiaofeng Ren, James M. Rehg,
Learning to Recognize Objects in Egocentric Activities, CVPR, 2011
Yin Li, Zhefan Ye, James M. Rehg.
Delving into Egocentric Actions, CVPR 2015
To record the sequences, we stuffed a table with various kinds of food, dishes and snacks. We asked each subject to wear the Tobii glasses and calibrated the gaze. Then we asked the subject to take a sit and make whatever food they feel like having. The beginning and ending time of the actions are annotated. Each action consists of a verb and a set of nouns. For example pouring milk into cup. In our experiments we extract images from video at 15 frames per second. Action annotations are based on frame numbers. The following sequences are used for training: 1, 6, 7, 8, 10, 12, 13, 14, 16, 17, 18, 21, 22 and the following sequences are used for testing: 2, 3, 5, 20.
Alireza Fathi, Yin Li, James M. Rehg,
Learning to Recognize Daily Actions using Gaze, ECCV, 2012
We collected this dataset at Georgia Tech's AwareHome. This dataset consists of seven meal-preparation activities, performed by 26 subjects. Subjects perform the activities based on the given cooking recipes (get the recipes here).
Activities are: American Breakfast, Pizza, Snack, Greek Salad, Pasta Salad, Turkey Sandwich and Cheese Burger. SMI glasses record a HD video of subjects activities at 24 frames per second. They also record subject's gaze at 30 fps.
For each activity, we used ELAN to annotate its actions. An activity is a meal-preparation task such as making pizza, and an action is a short temporal segment such as putting sauce on the pizza crust, dicing the green peppers, washing the mushrooms, etc.
Specifically, EGTEA Gaze+ contains 28 hours (de-identified) of cooking activities from 86 unique sessions of 32 subjects. These videos comes with audios and gaze tracking (30Hz). We have further provided human annotations of actions (human-object interactions) and hand masks.
The action annotations include 10325 instances of fine-grained actions, such as "Cut bell pepper" or "Pour condiment (from) condiment container into salad".
The hand annotations consist of 15,176 hand masks from 13,847 frames from the videos.
Yin Li, Miao Liu, James M. Rehg,
In the eye of beholder: Joint learning of gaze and actions in first person video, ECCV, 2018