A hierarchical graphical model for recognizing human actions and interactions in video

Park, Sangho

A hierarchical graphical model for recognizing human actions and interactions in video

dc.contributor.advisor	Aggarwal, J. K.	en
dc.creator	Park, Sangho	en
dc.date.accessioned	2008-08-28T22:35:37Z	en
dc.date.available	2008-08-28T22:35:37Z	en
dc.date.issued	2004	en
dc.description	text	en
dc.description.abstract	Understanding human behavior in video data is essential in numerous applications including smart surveillance, video annotation/retrieval, and human – computer interaction. Recognizing human interactions is a challenging task due to ambiguity in body articulation, mutual occlusion, and shadows. Past research has focused on a coarse-level recognition of human interactions or on the recognition of a specific gesture of a single body part. It is our objective to develop methods to recognize human actions and interactions at a detailed level. The focus of this research is to develop a framework for recognizing human actions and interactions in color video. This dissertation presents a hierarchical graphical model that unifies multiple-level processing in video computing. The video – color image sequence – is processed at four levels: pixel level, blob level, object level, and event level. A mixture of Gaussian (MOG) model is used at the pixel level to train and classify individual pixel colors. A relaxation labeling with attribute relational graph (ARG) is used at the blob level to merge the pixels into coherent blobs and to register inter-blob relations. At the object level, the poses of individual body parts including head, torso, arms and legs are recognized using individual Bayesian networks (BNs), which are then integrated to obtain an overall body pose. At the event level, the actions of a single person are modeled using a dynamic Bayesian network (DBN) with temporal links between identical nodes of the Bayesian network at time t and t+1. At this event level, the results of the object-level descriptions for each person are juxtaposed along a common timeline to identify an interaction between two persons. The linguistic ‘verb argument structure’ is used to represent human action in terms of <agent-motion-target> triplets. Spatial and temporal constraints are used for a decision tree to recognize specific interactions. A meaningful semantic description in terms of <subject-verb-object> is obtained. Our method provides a user-friendly natural-language description of various human actions and interactions using event semantics. Our system correctly recognizes various human actions involving the motions of the torso, arms and/or legs, and our system achieves semantic descriptions of positive, neutral, and negative interactions between two persons including hand-shaking, standing hand-in-hand, and hugging as the positive interactions, approaching, departing, and pointing as the neutral interactions, and pushing, punching, and kicking as the negative interactions.
dc.description.department	Electrical and Computer Engineering	en
dc.format.medium	electronic	en
dc.identifier	b60834882	en
dc.identifier.oclc	68965697	en
dc.identifier.proqst	3144668	en
dc.identifier.uri	http://hdl.handle.net/2152/2160	en
dc.language.iso	eng	en
dc.rights	Copyright is held by the author. Presentation of this material on the Libraries' web site by University Libraries, The University of Texas at Austin was made possible under a limited license grant from the author who has retained all copyrights in the works.	en
dc.subject.lcsh	Computer vision	en
dc.subject.lcsh	Motion perception (Vision)--Data processing	en
dc.subject.lcsh	Electronic surveillance--Data processing	en
dc.subject.lcsh	Graphical modeling (Statistics)	en
dc.subject.lcsh	Bayesian statistical decision theory	en
dc.title	A hierarchical graphical model for recognizing human actions and interactions in video	en
dc.type.genre	Thesis	en

Collections

University of Texas at Austin

A hierarchical graphical model for recognizing human actions and interactions in video

Files

Collections