A hierarchical graphical model for recognizing human actions and interactions in video

dc.contributor.advisorAggarwal, J. K.en
dc.creatorPark, Sanghoen
dc.date.accessioned2008-08-28T22:35:37Zen
dc.date.available2008-08-28T22:35:37Zen
dc.date.issued2004en
dc.descriptiontexten
dc.description.abstractUnderstanding human behavior in video data is essential in numerous applications including smart surveillance, video annotation/retrieval, and human – computer interaction. Recognizing human interactions is a challenging task due to ambiguity in body articulation, mutual occlusion, and shadows. Past research has focused on a coarse-level recognition of human interactions or on the recognition of a specific gesture of a single body part. It is our objective to develop methods to recognize human actions and interactions at a detailed level. The focus of this research is to develop a framework for recognizing human actions and interactions in color video. This dissertation presents a hierarchical graphical model that unifies multiple-level processing in video computing. The video – color image sequence – is processed at four levels: pixel level, blob level, object level, and event level. A mixture of Gaussian (MOG) model is used at the pixel level to train and classify individual pixel colors. A relaxation labeling with attribute relational graph (ARG) is used at the blob level to merge the pixels into coherent blobs and to register inter-blob relations. At the object level, the poses of individual body parts including head, torso, arms and legs are recognized using individual Bayesian networks (BNs), which are then integrated to obtain an overall body pose. At the event level, the actions of a single person are modeled using a dynamic Bayesian network (DBN) with temporal links between identical nodes of the Bayesian network at time t and t+1. At this event level, the results of the object-level descriptions for each person are juxtaposed along a common timeline to identify an interaction between two persons. The linguistic ‘verb argument structure’ is used to represent human action in terms of <agent-motion-target> triplets. Spatial and temporal constraints are used for a decision tree to recognize specific interactions. A meaningful semantic description in terms of <subject-verb-object> is obtained. Our method provides a user-friendly natural-language description of various human actions and interactions using event semantics. Our system correctly recognizes various human actions involving the motions of the torso, arms and/or legs, and our system achieves semantic descriptions of positive, neutral, and negative interactions between two persons including hand-shaking, standing hand-in-hand, and hugging as the positive interactions, approaching, departing, and pointing as the neutral interactions, and pushing, punching, and kicking as the negative interactions.
dc.description.departmentElectrical and Computer Engineeringen
dc.format.mediumelectronicen
dc.identifierb60834882en
dc.identifier.oclc68965697en
dc.identifier.proqst3144668en
dc.identifier.urihttp://hdl.handle.net/2152/2160en
dc.language.isoengen
dc.rightsCopyright is held by the author. Presentation of this material on the Libraries' web site by University Libraries, The University of Texas at Austin was made possible under a limited license grant from the author who has retained all copyrights in the works.en
dc.subject.lcshComputer visionen
dc.subject.lcshMotion perception (Vision)--Data processingen
dc.subject.lcshElectronic surveillance--Data processingen
dc.subject.lcshGraphical modeling (Statistics)en
dc.subject.lcshBayesian statistical decision theoryen
dc.titleA hierarchical graphical model for recognizing human actions and interactions in videoen
dc.type.genreThesisen

Files