Browsing by Author "Wan, Shaohua"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Learning to recognize egocentric activities using RGB-D data(2015-12) Wan, Shaohua; Aggarwal, J.K. (Jagdishkumar Keshoram), 1936-; Swartzlander, Earl E., Jr., 1945-; Grauman, Kristen; Geisler, Wilson; de Veciana, Gustavo; Dhillon, InderjitThere are two recent trends that are changing the landscape of vision-based activity recognition. On one hand, wearable cameras have become widely used for recording daily life activities. With a growing number of egocentric videos generated, there is an increasing need to develop computer vision algorithms that are tailored to the egocentric paradigm. On the other hand, the advances in sensing technologies, especially the introduction of Kinect-style depth sensors, have greatly facilitated the measurement of distance information in the 3D world. The aim of my work is to develop algorithms for egocentric activity recognition using RGB-D data. Compared to conventional approaches on third-person activity recognition which commonly use local space-time features for representing the activities, my approach to egocentric activity recognition is novel in three aspects. First, my approach is context-aware and automatically discovers the scene attributes that characterize the context. Egocentric activities tend to co-occur with certain types of scene context, e.g., cooking in the kitchen or driving in the car. To model the scene context, I propose a novel latent topic model, known as Supervised Block Latent Dirichlet Allocation (sBlock-LDA), to discover the semantic attributes of the scene context. The standard LDA model can be considered as a special case of sBlock-LDA when setting the correlation between different latent topics as zero. To ensure that a scene is only a sparse mixture of latent topics, a Gini impurity based regularizer is used to control the freedom of visual words taking on different latent topics. I further show that the proposed model can be easily extended to account for the global spatial layout of the latent topics by treating latent topic positions as hidden variables. Second, my approach is object-centric and robust to object appearance variations. Since egocentric activities heavily involve manipulating objects, object features are another important source of information for recognizing egocentric activities. In order to effectively exploit the varied object appearance in a video, I take a set-based recognition approach and represent the target object using the set of frames contained in the video. A novel kernel function, the Sparse Affine Hull kernel, is proposed that measures the similarity of two sets by the minimum distance between the sparse affine hulls of the two sets. The proposed kernel also allows convenient integration of heterogeneous data modalities beyond RGB and depth. Third, my approach is state-specific and automatically learns the importance of each state. An egocentric activity by its nature involves a series of maneuvers that result in changes to the object. Effectively encoding the state transitions of an egocentric activity in terms of hand maneuvers and object changes is key to successful activity recognition. While existing algorithms commonly use manually defined states to train action classifiers, I present a novel model that automatically mines discriminative states for recognizing egocentric actions. To mine discriminative states, I draw on the Sparse Affine Hull kernel and formulate a Multiple Kernel Learning based framework to learn adaptive weights for different states. Last but not least, I also propose a novel algorithm for segmenting long-scale activities into short, atomic sub-activities. Hidden Markov Models (HMMs) have been the state-of-the-art techniques for modeling human activities despite their unrealistic first-order Markov assumptions and the very limited representational capacity of the hidden states. I propose two enhancements that significantly improve the performance of HMM-based activity segmentation and recognition: (1) Deep Neural Nets (DNNs) is used to model the observations in each state. This is motivated by the recent success of deep architectures in learning complex statistical correlations from high-dimensional data. (2) State-duration variables are incorporated to explicitly address the temporal span of each state. This helps improve contextual compatibility and eliminate incoherent activity segments. In summary, I have developed a series of algorithms that are aimed at automatic interpretation of egocentric activity videos. It is demonstrated that depth data benefits egocentric activity recognition in terms of target localization and feature representation. It is also demonstrated that the proposed algorithms are significantly more robust than traditional algorithms when applied to the egocentric domain. This work contributes significantly to research on egocentric activity analysis.Item A scalable metric learning based voting method for expression recognition(2013-05) Wan, Shaohua; Aggarwal, J. K. (Jagdishkumar Keshoram), 1936-In this research work, we propose a facial expression classification method using metric learning-based k-nearest neighbor voting. To achieve accurate classification of a facial expression from frontal face images, we first learn a distance metric structure from training data that characterizes the feature space pattern, then use this metric to retrieve the nearest neighbors from the training dataset, and finally output the classification decision accordingly. An expression is represented as a fusion of face shape and texture. This representation is based on registering a face image with a landmarking shape model and extracting Gabor features from local patches around landmarks. This type of representation achieves robustness and effectiveness by using an ensemble of local patch feature detectors at a global shape level. A naive implementation of the metric learning-based k-nearest neighbor would incur a time complexity proportional to the size of the training dataset, which precludes this method being used with enormous datasets. To scale to potential larger databases, a similar approach to that in [24] is used to achieve an approximate yet efficient ML-based kNN voting based on Locality Sensitive Hashing (LSH). A query example is directly hashed to the bucket of a pre-computed hash table where candidate nearest neighbors can be found, and there is no need to search the entire database for nearest neighbors. Experimental results on the Cohn-Kanade database and the Moving Faces and People database show that both ML-based kNN voting and its LSH approximation outperform the state-of-the-art, demonstrating the superiority and scalability of our method.