Learning to recognize egocentric activities using RGB-D data

Date

2015-12

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

There are two recent trends that are changing the landscape of vision-based activity recognition. On one hand, wearable cameras have become widely used for recording daily life activities. With a growing number of egocentric videos generated, there is an increasing need to develop computer vision algorithms that are tailored to the egocentric paradigm. On the other hand, the advances in sensing technologies, especially the introduction of Kinect-style depth sensors, have greatly facilitated the measurement of distance information in the 3D world. The aim of my work is to develop algorithms for egocentric activity recognition using RGB-D data. Compared to conventional approaches on third-person activity recognition which commonly use local space-time features for representing the activities, my approach to egocentric activity recognition is novel in three aspects. First, my approach is context-aware and automatically discovers the scene attributes that characterize the context. Egocentric activities tend to co-occur with certain types of scene context, e.g., cooking in the kitchen or driving in the car. To model the scene context, I propose a novel latent topic model, known as Supervised Block Latent Dirichlet Allocation (sBlock-LDA), to discover the semantic attributes of the scene context. The standard LDA model can be considered as a special case of sBlock-LDA when setting the correlation between different latent topics as zero. To ensure that a scene is only a sparse mixture of latent topics, a Gini impurity based regularizer is used to control the freedom of visual words taking on different latent topics. I further show that the proposed model can be easily extended to account for the global spatial layout of the latent topics by treating latent topic positions as hidden variables. Second, my approach is object-centric and robust to object appearance variations. Since egocentric activities heavily involve manipulating objects, object features are another important source of information for recognizing egocentric activities. In order to effectively exploit the varied object appearance in a video, I take a set-based recognition approach and represent the target object using the set of frames contained in the video. A novel kernel function, the Sparse Affine Hull kernel, is proposed that measures the similarity of two sets by the minimum distance between the sparse affine hulls of the two sets. The proposed kernel also allows convenient integration of heterogeneous data modalities beyond RGB and depth. Third, my approach is state-specific and automatically learns the importance of each state. An egocentric activity by its nature involves a series of maneuvers that result in changes to the object. Effectively encoding the state transitions of an egocentric activity in terms of hand maneuvers and object changes is key to successful activity recognition. While existing algorithms commonly use manually defined states to train action classifiers, I present a novel model that automatically mines discriminative states for recognizing egocentric actions. To mine discriminative states, I draw on the Sparse Affine Hull kernel and formulate a Multiple Kernel Learning based framework to learn adaptive weights for different states. Last but not least, I also propose a novel algorithm for segmenting long-scale activities into short, atomic sub-activities. Hidden Markov Models (HMMs) have been the state-of-the-art techniques for modeling human activities despite their unrealistic first-order Markov assumptions and the very limited representational capacity of the hidden states. I propose two enhancements that significantly improve the performance of HMM-based activity segmentation and recognition: (1) Deep Neural Nets (DNNs) is used to model the observations in each state. This is motivated by the recent success of deep architectures in learning complex statistical correlations from high-dimensional data. (2) State-duration variables are incorporated to explicitly address the temporal span of each state. This helps improve contextual compatibility and eliminate incoherent activity segments. In summary, I have developed a series of algorithms that are aimed at automatic interpretation of egocentric activity videos. It is demonstrated that depth data benefits egocentric activity recognition in terms of target localization and feature representation. It is also demonstrated that the proposed algorithms are significantly more robust than traditional algorithms when applied to the egocentric domain. This work contributes significantly to research on egocentric activity analysis.

Description

Citation