Browsing by Subject "Computer vision"
Now showing 1 - 20 of 32
Results Per Page
Sort Options
Item 3-D modelling and classification in automated target recognition(Texas Tech University, 1990-08) Nutter, BrianAutomated target recognition (ATR) using a computer vision system is a problem of extremely high complexity. A 3-D object recognition scheme involves many image analysis and enhancement techniques, including image processing, image segmentation, image registration, and object modeling and projection. This dissertation addresses the problem of 3-D object recognition using five distinct methods of matching image data with model projection-derived data. In analyzing each digitized video image, a variety of techniques, including an optimal gray level map for correlating binary line drawings with gradient images, were used to enhance the visibility of particular features and to increase signal to noise ratios in images. The shapes extracted from these enhanced images were then analyzed in a number of fashions, including the statistical descriptors of Karhunen-Loeve transformation. The first of the five object identification methods which were tested compared descriptors of the object to be analyzed with those of model projections meeting certain criteria. The second compared the object descriptors to those of a precalculated series of model projections. The third method used the descriptions of the second method as a starting point for a neural network, and then proceeded to learn the differences between these model projections and actual data. The neural net as realized demonstrated a great reduction in training time over conventional implementations. Calibration difficulties of other methods were greatly reduced by the learning capability of the neural net. The fourth method cross-correlated the optimally mapped gradient of the object image with a series of model projections. Finally, ways of combining these methods to utilize the strengths of each were investigated. Superior accuracy was obtained for cross-correlation. Optimal techniques which significantly reduced the number of required correlations and hence the computational load were also found to give very accurate results.Item A graph model for scene based image analysis and classification using epipolar geometry(Texas Tech University, 2005-05) Muthukumar, Sivabalan; Sinzinger, Eric D.; Hernandez, Hector J.; Lakhani, GopalThis thesis presents a system that analyzes a collection of images and generates a model which describes the relationship between them. The main focus is to develop a system which can provide answers to the following two questions: Do the images match? Are they part of the same scene ? The answers to these questions can be used to classify the image collection into distinct groups. This transformation is achieved in three stages of processing. The firstis concerned with detection and extraction of features from images. The second stage focuses on matching the extracted features and the determination of the epipolar geometry. The final stage involves using the results obtained from the previous stage to develop a graph model. The main aim is to efficiently capture and represent the relationship between the images using this model and provide answers to the questions described above in a simple and effective manner. The goal is to classify the image collection into distinct group based on scene analysis by using the answers provided by the model regarding the nature of the relationship between the various images that are part of the given collection.Item A graph model for scene based image analysis and classification using Epipolar Geometry(2005-05) Muthukumar, Sivabalan; Sinzinger, Eric D.; Hernandez, Hector J.; Lakhani, GopalThis thesis presents a system that analyzes a collection of images and generates a model which describes the relationship between them. The main focus is to develop a system which can provide answers to the following two questions: Do the images match? Are they part of the same scene ? The answers to these questions can be used to classify the image collection into distinct groups. This transformation is achieved in three stages of processing. The firstis concerned with detection and extraction of features from images. The second stage focuses on matching the extracted features and the determination of the epipolar geometry. The final stage involves using the results obtained from the previous stage to develop a graph model. The main aim is to efficiently capture and represent the relationship between the images using this model and provide answers to the questions described above in a simple and effective manner. The goal is to classify the image collection into distinct group based on scene analysis by using the answers provided by the model regarding the nature of the relationship between the various images that are part of the given collection.Item Active learning of an action detector on untrimmed videos(2013-05) Bandla, Sunil; Grauman, Kristen Lorraine, 1979-Collecting and annotating videos of realistic human actions is tedious, yet critical for training action recognition systems. We propose a method to actively request the most useful video annotations among a large set of unlabeled videos. Predicting the utility of annotating unlabeled video is not trivial, since any given clip may contain multiple actions of interest, and it need not be trimmed to temporal regions of interest. To deal with this problem, we propose a detection-based active learner to train action category models. We develop a voting-based framework to localize likely intervals of interest in an unlabeled clip, and use them to estimate the total reduction in uncertainty that annotating that clip would yield. On three datasets, we show our approach can learn accurate action detectors more efficiently than alternative active learning strategies that fail to accommodate the "untrimmed" nature of real video data.Item Advanced techniques for digital image processing(Texas Tech University, 1986-05) Tarng, Jaw-horngA new algorithm for enhancing a degraded grey scale image is proposed here. The enhancement algorithm is a locally adaptive Fourier filter which locates and analyzes the Fourier spectral information and then enhances the identifying features. Thus, it can achieve a better enhancement result than conventional homomorphic FFT techniques. By using a short space basis implementation, a large amount of memory space can be saved, consequently the computation speed is greatly improved. The primary objective of this algorithm is to extract linear features from a noisy image. However, the algorithm also can be modified in order to enhance other different kinds of features. The main advantages of this algorithm are: 1. It requires a small amount of computer memory; this makes it easy to implement in small computers. 2. It has fast processing speed. 3. It is powerful in extracting local linear features.Item Analog VLSI implementation of a Gabor convolution for real time image processing(Texas Tech University, 1996-05) Moldovan, LaszloNot availableItem Automated Registration of Point Clouds with High Resolution Photographs and Rendering Under Novel Illumination Conditions(2010-12) Srisinroongruang, Rattasak; Sinzinger, Eric D.; Hoo, Karlene A.; Youn, Eunseog; Lakhani, GopalWith the increased computing power of modern technology, it has become feasible to digitally capture real world scenes and objects, preserving the scenes and objects indefinitely. Additionally, digitally capturing a scene provides the flexibility to re- visualize it under novel illumination conditions that may never occur at the scene’s real location. These two tools, scene capture and redisplay, are at the focal point of this proposal. Scene capture requires recording the spatial and intensity data of a real world scene. This is accomplished using LIDAR (a method of laser positioning) and pho- tographic cameras respectively. Once acquired, the data sets need to be registered together. This is the computation of a mathematical transform to that maps the photographic images onto the spatial data. Typically, this has been done using a significant amount of user interventation or requires the placement of distinguishing markers in the real scene. To remove these requirements and handle large data sets, the performed research submits methods to automatically compute the mathematical transforms between data sets with minimal manual intervention typically required in the current state of the art. This will be accomplished by posing the problem as an optimization problem with an objective function based upon a novel error metric. The redisplay portion of the research submits a novel rendering equation that is able to take cues from a photograph and realistically insert a synthetic object into the novel environment depicted in the photograph. This rendering equation allows the object to react realistically to the illumination conditions in the environment which may be substantially different from the environment conditions when the object or scene was captured.Item Automatic fabric dimensional distortion measurement and wrinkle evaluation(Texas Tech University, 2001-05) Dai, YongmeiDimensional change and wrinkling are two of the most significant fabric quality factors, and thus their measurement and evaluation are very crucial to the textile industry. Currently, industrial fabric dimensional change measurements are done manually by technicians with rulers, and fabric wrinkling evaluation is performed by technicians by visually comparing the perceived wrinkling to a set of visual standards. Both methods are highly subjective and. therefore, not very accurate. In recent years, there has been an increasing demand from industry for the development of objective and automatic approaches to replace these methods. The content of this thesis is dedicated to discussing a novel, computer-aided method that was developed to automatically predict fabric shrinkage and wrinkle with very high accuracy. Fabric shrinkage is measured as the percentage between the initial and final dimensions of the fabric specimen. To measure shrinkage, a pattern of benchmarks is drawn on fabric samples before they are washed. These marks serve as registration points to facilitate comparison before and after washing. Digital images of the marked fabric samples are obtained by scanning the samples with a standard flatbed scanner. The problem is to develop a computer program that can accurately locate the marks in the digital images regardless of the color of both the marks (since the color of the marks is chosen at random) and the samples. The program must also be robust in the presence of noise in the fabric, such as lint and dirt spots. By obtaining position translation of the marks drawn on the fabric before and after it is washed, shrinkage can be calculated. A detailed description of the methods used to accomplish these tasks is given in the thesis.Item Calibration and three-dimensional reconstruction using epipolar constraints on a structured light computer vision system(Texas Tech University, 1997-05) Lin, ChangxingA new structured light computer vision system was developed to determine 3 dimensional geometry information of objects. The system was composed of a dot matrix pattern laser projector, and two cameras (labeled as A and B). Here, the camera A is called main camera. The cameras (B) functions as a checking device to determine the correct image matching between the main image and the projector, so it is called checking camera. There are three contributions in this dissertation and they are as follows: First, a new camera calibration technique is provided, in which the image center, uncertainty scale factor, camera focal length, rotation matrix, and translation vector can be determined using at least seven noncoplanar calibration points; the orthogonality of rotation matrix can be satisfied not only theoretical but also numerically with actual calibration; all intrinsic and extrinsic parameters can be determined using the same set of data; no assumption is needed for the world coordinate system setup; and no nonlinear techniques are required. Second, a new linear approach for estimating the epipolar lines on the main camera (A), related to the projector, is developed. The existing methods can not guarantee that all image points on the same epipolar line on the main camera, related to the projector, have the same corresponding epipolar line on the projector. This is against the epipolar geometric constraints. The approach developed here can guarantee that all points on the same epipolar line on the main image, related to the projector, must have the same corresponding epipolar line on the projector. Third, two checking point equations are given to determine the correct image matching among the main image, the checking image, and the projector. The methods developed here, only require use of the epipolar lines on the projector, related to the main camera (A). Calibration of the projector is not required. A review of the state of the art is given in the first three chapters. All methods developed here were verified experimentally.Item CCD imaging with a TMS34010 graphics system processor(Texas Tech University, 1990-05) Mueller, Curtis WayneNot availableItem Developing computer-generated stereoscopic haptic images(Texas Tech University, 1998-12) Watson, Kirk LNot availableItem Discriminative object categorization with external semantic knowledge(2013-08) Hwang, Sung Ju; Grauman, Kristen Lorraine, 1979-Visual object category recognition is one of the most challenging problems in computer vision. Even assuming that we can obtain a near-perfect instance level representation with the advances in visual input devices and low-level vision techniques, object categorization still remains as a difficult problem because it requires drawing boundaries between instances in a continuous world, where the boundaries are solely defined by human conceptualization. Object categorization is essentially a perceptual process that takes place in a human-defined semantic space. In this semantic space, the categories reside not in isolation, but in relation to others. Some categories are similar, grouped, or co-occur, and some are not. However, despite this semantic nature of object categorization, most of the today's automatic visual category recognition systems rely only on the category labels for training discriminative recognition with statistical machine learning techniques. In many cases, this could result in the recognition model being misled into learning incorrect associations between visual features and the semantic labels, from essentially overfitting to training set biases. This limits the model's prediction power when new test instances are given. Using semantic knowledge has great potential to benefit object category recognition. First, semantic knowledge could guide the training model to learn a correct association between visual features and the categories. Second, semantics provide much richer information beyond the membership information given by the labels, in the form of inter-category and category-attribute distances, relations, and structures. Finally, the semantic knowledge scales well as the relations between categories become larger with an increasing number of categories. My goal in this thesis is to learn discriminative models for categorization that leverage semantic knowledge for object recognition, with a special focus on the semantic relationships among different categories and concepts. To this end, I explore three semantic sources, namely attributes, taxonomies, and analogies, and I show how to incorporate them into the original discriminative model as a form of structural regularization. In particular, for each form of semantic knowledge I present a feature learning approach that defines a semantic embedding to support the object categorization task. The regularization penalizes the models that deviate from the known structures according to the semantic knowledge provided. The first semantic source I explore is attributes, which are human-describable semantic characteristics of an instance. While the existing work treated them as mid-level features which did not introduce new information, I focus on their potential as a means to better guide the learning of object categories, by enforcing the object category classifiers to share features with attribute classifiers, in a multitask feature learning framework. This approach essentially discovers the common low-dimensional features that support predictions in both semantic spaces. Then, I move on to the semantic taxonomy, which is another valuable source of semantic knowledge. The merging and splitting criteria for the categories on a taxonomy are human-defined, and I aim to exploit this implicit semantic knowledge. Specifically, I propose a tree of metrics (ToM) that learns metrics that capture granularity-specific similarities at different nodes of a given semantic taxonomy, and uses a regularizer to isolate granularity-specific disjoint features. This approach captures the intuition that the features used for the discrimination of the parent class should be different from the features used for the children classes. Such learned metrics can be used for hierarchical classification. The use of a single taxonomy can be limited in that its structure is not optimal for hierarchical classification, and there may exist no single optimal semantic taxonomy that perfectly aligns with visual distributions. Thus, I next propose a way to overcome this limitation by leveraging multiple taxonomies as semantic sources to exploit, and combine the acquired complementary information across multiple semantic views and granularities. This allows us, for example, to synthesize semantics from both 'Biological', and 'Appearance'-based taxonomies when learning the visual features. Finally, as a further exploration of more complex semantic relations different from the previous two pairwise similarity-based models, I exploit analogies, which encode the relational similarities between two related pairs of categories. Specifically, I use analogies to regularize a discriminatively learned semantic embedding space for categorization, such that the displacements between the two category embeddings in both category pairs of the analogy are enforced to be the same. Such a constraint allows for a more confusing pair of categories to benefit from a clear separation in the matched pair of categories that share the same relation. All of these methods are evaluated on challenging public datasets, and are shown to effectively improve the recognition accuracy over purely discriminative models, while also guiding the recognition to be more semantic to human perception. Further, the applications of the proposed methods are not limited to visual object categorization in computer vision, but they can be applied to any classification problems where there exists some domain knowledge about the relationships or structures between the classes. Possible applications of my methods outside the visual recognition domain include document classification in natural language processing, and gene-based animal or protein classification in computational biology.Item DSP-enhanced vision recognition(Texas Tech University, 1990-12) Sharbutt, Albert CThis thesis addresses the problem of the time involved in performing image recognition or image processing. Digital signal processing chips can be used alone or as additions to existing processors to increase the throughput and versatility of these systems. This thesis does not seek to develop a vision recognition system, but examines several common vision recognition tasks to determine how much improvement could be offered by a digital signal processing chip and how much effort is required to achieve this benefit.Item En-co.de : a web service for augmenting physical objects with an active digital presence(2013-08) Westing, Brandt Michael; Aziz, AdnanIt is now possible for physical objects to have a dynamic digital presence via active identification codes that can be scanned via ubiquitous devices such as smart phones or tablets. En-co.de is a web service for the generation, storing, retrieval, and augmentation of metadata associated with physical objects. The service is built upon the concept that metadata associated with an object can be retrieved through a Quick Response (QR) coded URL. En-co.de serves to link a physical entity to a digital archive of information and provides services such as geolocation and SMS text alerts when an object's identifier, or tag, is scanned. I provide an analysis of QR code qualitative characteristics, the architecture for the en-co.de web service, a scalability study of the en-co.de architecture, and the results of the completed service in production in this report. In addition, the analysis is complemented with an evaluation of comparable identification schemes and web services.Item Face recognition from video(2011-12) Harguess, Joshua David; Aggarwal, J. K. (Jagdishkumar Keshoram), 1936-; Bovik, Al; Ghosh, Joydeep; Grauman, Kristen; Ryoo, MichaelWhile the area of face recognition has been extensively studied in recent years, it remains a largely open problem, despite what movie and television studios would leave you to believe. Frontal, still face recognition research has seen a lot of success in recent years from any different researchers. However,the accuracy of such systems can be greatly diminished in cases such as increasing the variability of the database,occluding the face, and varying the illumination of the face. Further varying the pose of the face (yaw, pitch, and roll) and the face expression (smile, frown, etc.) adds even more complexity to the face recognition task, such as in the case of face recognition from video. In a more realistic video surveillance setting, a face recognition system should be robust to scale, pose, resolution, and occlusion as well as successfully track the face between frames. Also, a more advanced face recognition system should be able to improve the face recognition result by utilizing the information present in multiple video cameras. We approach the problem of face recognition from video in the following manner. We assume that the training data for the system consists of only still image data, such as passport photos or mugshots in a real-world system. We then transform the problem of face recognition from video to a still face recognition problem. Our research focuses on solutions to detecting, tracking and extracting face information from video frames so that they may be utilized effectively in a still face recognition system. We have developed four novel methods that assist in face recognition from video and multiple cameras. The first uses a patch-based method to handle the face recognition task when only patches, or parts, of the face are seen in a video, such as when occlusion of the face happens often. The second uses multiple cameras to fuse the recognition results of multiple cameras to improve the recognition accuracy. In the third solution, we utilize multiple overlapping video cameras to improve the face tracking result which thus improves the face recognition accuracy of the system. We additionally implement a methodology to detect and handle occlusion so that unwanted information is not used in the tracking algorithm. Finally, we introduce the average-half-face, which is shown to improve the results of still face recognition by utilizing the symmetry of the face. In one attempt to understand the use of the average-half-face in face recognition, an analysis of the effect of face symmetry on face recognition results is shown.Item Foveated object recognition by corner search(2008-05) Arnow, Thomas Louis, 1946-; Bovik, Alan C. (Alan Conrad), 1958-; Geisler, Wilson S.Here we describe a gray scale object recognition system based on foveated corner finding, the computation of sequential fixation points, and elements of Lowe’s SIFT transform. The system achieves rotational, transformational, and limited scale invariant object recognition that produces recognition decisions using data extracted from sequential fixation points. It is broken into two logical steps. The first is to develop principles of foveated visual search and automated fixation selection to accomplish corner search. The result is a new algorithm for finding corners which is also a corner-based algorithm for aiming computed foveated visual fixations. In the algorithm, long saccades move the fovea to previously unexplored areas of the image, while short saccades improve the accuracy of putative corner locations. The system is tested on two natural scenes. As an interesting comparison study we compare fixations generated by the algorithm with those of subjects viewing the same images, whose eye movements are being recorded by an eyetracker. The comparison of fixation patterns is made using an information-theoretic measure. Results show that the algorithm is a good locator of corners, but does not correlate particularly well with human visual fixations. The second step is to use the corners located, which meet certain goodness criteria, as keypoints in a modified version of the SIFT algorithm. Two scales are implemented. This implementation creates a database of SIFT features of known objects. To recognize an unknown object, a corner is located and a feature vector created. The feature vector is compared with those in the database of known objects. The process is continued for each corner in the unknown object until enough information has been accumulated to reach a decision. The system was tested on 78 gray scale objects, hand tools and airplanes, and shown to perform well.Item Fusion of depth and intensity data for three-dimensional object representation and recognition(Texas Tech University, 1991-12) Ramirez Cortes, Juan ManuelFor humans, retinal images provide sufficient information for the complete understanding of three-dimensional (3-D) shapes in a scene. The ultimate goal of computer vision is to develop an automated system able to reproduce some of the tasks performed in a natural way by human beings as recognition, classification, or analysis of the environment as basis for further decisions. At the first level, referred to as early computer vision, the task is to extract symbolic descriptive information in a scene from a variety of sensory data. The second level is concerned with classification, recognition, or decision systems and the related heuristics, that aid the processing of the available information. This research is concerned with a new approach to 3-D object representation and recognition using an interpolation scheme applied to the information from the fusion of range and intensity data. The range image acquisition uses a methodology based on a passive stereo-vision model originally developed to be used with a sequence of images.^^ However, curved features, large disparities and noisy input images are some of the problems associated with real imagery, which need to be addressed prior to applying the matching techniques in the spatial frequency domain. Some of the above mentioned problems can only be solved by computationally intensive spatial domain algorithms. Regularization techniques are explored for surface recovery from sparse range data, and intensity images are incorporated in the final representation of the surface. As an important application, the problem of 3-D representation of retinal images for extraction of quantitative information is addressed. Range information is also combined with intensity data to provide a more accurate numerical description based on aspect graphs. This representation is used as input to a three-dimensional object recognition system. Such an approach results in an improved performance of 3-D object classifiers.Item A hierarchical graphical model for recognizing human actions and interactions in video(2004) Park, Sangho; Aggarwal, J. K.Understanding human behavior in video data is essential in numerous applications including smart surveillance, video annotation/retrieval, and human – computer interaction. Recognizing human interactions is a challenging task due to ambiguity in body articulation, mutual occlusion, and shadows. Past research has focused on a coarse-level recognition of human interactions or on the recognition of a specific gesture of a single body part. It is our objective to develop methods to recognize human actions and interactions at a detailed level. The focus of this research is to develop a framework for recognizing human actions and interactions in color video. This dissertation presents a hierarchical graphical model that unifies multiple-level processing in video computing. The video – color image sequence – is processed at four levels: pixel level, blob level, object level, and event level. A mixture of Gaussian (MOG) model is used at the pixel level to train and classify individual pixel colors. A relaxation labeling with attribute relational graph (ARG) is used at the blob level to merge the pixels into coherent blobs and to register inter-blob relations. At the object level, the poses of individual body parts including head, torso, arms and legs are recognized using individual Bayesian networks (BNs), which are then integrated to obtain an overall body pose. At the event level, the actions of a single person are modeled using a dynamic Bayesian network (DBN) with temporal links between identical nodes of the Bayesian network at time t and t+1. At this event level, the results of the object-level descriptions for each person are juxtaposed along a common timeline to identify an interaction between two persons. The linguistic ‘verb argument structure’ is used to represent human action in terms of triplets. Spatial and temporal constraints are used for a decision tree to recognize specific interactions. A meaningful semantic description in terms of is obtained. Our method provides a user-friendly natural-language description of various human actions and interactions using event semantics. Our system correctly recognizes various human actions involving the motions of the torso, arms and/or legs, and our system achieves semantic descriptions of positive, neutral, and negative interactions between two persons including hand-shaking, standing hand-in-hand, and hugging as the positive interactions, approaching, departing, and pointing as the neutral interactions, and pushing, punching, and kicking as the negative interactions.Item Interactive image search with attributes(2014-08) Kovashka, Adriana Ivanova; Grauman, Kristen Lorraine, 1979-An image retrieval system needs to be able to communicate with people using a common language, if it is to serve its user's information need. I propose techniques for interactive image search with the help of visual attributes, which are high-level semantic visual properties of objects (like "shiny" or "natural"), and are understandable by both people and machines. My thesis explores attributes as a novel form of user input for search. I show how to use attributes to provide relevance feedback for image search; how to optimally choose what to seek feedback on; how to ensure that the attribute models learned by a system align with the user's perception of these attributes; how to automatically discover the shades of meaning that users employ when applying an attribute term; and how attributes can help learn object category models. I use attributes to provide a channel on which the user of an image retrieval system can communicate her information need precisely and with as little effort as possible. One-shot retrieval is generally insufficient, so interactive retrieval systems seek feedback from the user on the currently retrieved results, and adapt their relevance ranking function accordingly. In traditional interactive search, users mark some images as "relevant" and others as "irrelevant", but this form of feedback is limited. I propose a novel mode of feedback where a user directly describes how high-level properties of retrieved images should be adjusted in order to more closely match her envisioned target images, using relative attribute feedback statements. For example, when conducting a query on a shopping website, the user might state: "I want shoes like these, but more formal." I demonstrate that relative attribute feedback is more powerful than traditional binary feedback. The images believed to be most relevant need not be most informative for reducing the system's uncertainty, so it might be beneficial to seek feedback on something other than the top-ranked images. I propose to guide the user through a coarse-to-fine search using a relative attribute image representation. At each iteration of feedback, the user provides a visual comparison between the attribute in her envisioned target and a "pivot" exemplar, where a pivot separates all database images into two balanced sets. The system actively determines along which of multiple such attributes the user's comparison should next be requested, based on the expected information gain that would result. The proposed attribute search trees allow us to limit the scan for candidate images on which to seek feedback to just one image per attribute, so it is efficient both for the system and the user. No matter what potentially powerful form of feedback the system offers the user, search efficiency will suffer if there is noise on the communication channel between the user and the system. Therefore, I also study ways to capture the user's true perception of the attribute vocabulary used in the search. In existing work, the underlying assumption is that an image has a single "true" label for each attribute that objective viewers could agree upon. However, multiple objective viewers frequently have slightly different internal models of a visual property. I pose user-specific attribute learning as an adaptation problem in which the system leverages any commonalities in perception to learn a generic prediction function. Then, it uses a small number of user-labeled examples to adapt that model into a user-specific prediction function. To further lighten the labeling load, I introduce two ways to extrapolate beyond the labels explicitly provided by a given user. While users differ in how they use the attribute vocabulary, there exist some commonalities and groupings of users around their attribute interpretations. Automatically discovering and exploiting these groupings can help the system learn more robust personalized models. I propose an approach to discover the latent factors behind how users label images with the presence or absence of a given attribute, from a sparse label matrix. I then show how to cluster users in this latent space to expose the underlying "shades of meaning" of the attribute, and subsequently learn personalized models for these user groups. Discovering the shades of meaning also serves to disambiguate attribute terms and expand a core attribute vocabulary with finer-grained attributes. Finally, I show how attributes can help learn object categories faster. I develop an active learning framework where the computer vision learning system actively solicits annotations from a pool of both object category labels and the objects' shared attributes, depending on which will most reduce total uncertainty for multi-class object predictions in the joint object-attribute model. Knowledge of an attribute's presence in an image can immediately influence many object models, since attributes are by definition shared across subsets of the object categories. The resulting object category models can be used when the user initiates a search via keywords such as "Show me images of cats" and then (optionally) refines that search with the attribute-based interactions I propose. My thesis exploits properties of visual attributes that allow search to be both effective and efficient, in terms of both user time and computation time. Further, I show how the search experience for each individual user can be improved, by modeling how she uses attributes to communicate with the retrieval system. I focus on the modes in which an image retrieval system communicates with its users by integrating the computer vision perspective and the information retrieval perspective to image search, so the techniques I propose are a promising step in closing the semantic gap.Item Learning to recognize egocentric activities using RGB-D data(2015-12) Wan, Shaohua; Aggarwal, J.K. (Jagdishkumar Keshoram), 1936-; Swartzlander, Earl E., Jr., 1945-; Grauman, Kristen; Geisler, Wilson; de Veciana, Gustavo; Dhillon, InderjitThere are two recent trends that are changing the landscape of vision-based activity recognition. On one hand, wearable cameras have become widely used for recording daily life activities. With a growing number of egocentric videos generated, there is an increasing need to develop computer vision algorithms that are tailored to the egocentric paradigm. On the other hand, the advances in sensing technologies, especially the introduction of Kinect-style depth sensors, have greatly facilitated the measurement of distance information in the 3D world. The aim of my work is to develop algorithms for egocentric activity recognition using RGB-D data. Compared to conventional approaches on third-person activity recognition which commonly use local space-time features for representing the activities, my approach to egocentric activity recognition is novel in three aspects. First, my approach is context-aware and automatically discovers the scene attributes that characterize the context. Egocentric activities tend to co-occur with certain types of scene context, e.g., cooking in the kitchen or driving in the car. To model the scene context, I propose a novel latent topic model, known as Supervised Block Latent Dirichlet Allocation (sBlock-LDA), to discover the semantic attributes of the scene context. The standard LDA model can be considered as a special case of sBlock-LDA when setting the correlation between different latent topics as zero. To ensure that a scene is only a sparse mixture of latent topics, a Gini impurity based regularizer is used to control the freedom of visual words taking on different latent topics. I further show that the proposed model can be easily extended to account for the global spatial layout of the latent topics by treating latent topic positions as hidden variables. Second, my approach is object-centric and robust to object appearance variations. Since egocentric activities heavily involve manipulating objects, object features are another important source of information for recognizing egocentric activities. In order to effectively exploit the varied object appearance in a video, I take a set-based recognition approach and represent the target object using the set of frames contained in the video. A novel kernel function, the Sparse Affine Hull kernel, is proposed that measures the similarity of two sets by the minimum distance between the sparse affine hulls of the two sets. The proposed kernel also allows convenient integration of heterogeneous data modalities beyond RGB and depth. Third, my approach is state-specific and automatically learns the importance of each state. An egocentric activity by its nature involves a series of maneuvers that result in changes to the object. Effectively encoding the state transitions of an egocentric activity in terms of hand maneuvers and object changes is key to successful activity recognition. While existing algorithms commonly use manually defined states to train action classifiers, I present a novel model that automatically mines discriminative states for recognizing egocentric actions. To mine discriminative states, I draw on the Sparse Affine Hull kernel and formulate a Multiple Kernel Learning based framework to learn adaptive weights for different states. Last but not least, I also propose a novel algorithm for segmenting long-scale activities into short, atomic sub-activities. Hidden Markov Models (HMMs) have been the state-of-the-art techniques for modeling human activities despite their unrealistic first-order Markov assumptions and the very limited representational capacity of the hidden states. I propose two enhancements that significantly improve the performance of HMM-based activity segmentation and recognition: (1) Deep Neural Nets (DNNs) is used to model the observations in each state. This is motivated by the recent success of deep architectures in learning complex statistical correlations from high-dimensional data. (2) State-duration variables are incorporated to explicitly address the temporal span of each state. This helps improve contextual compatibility and eliminate incoherent activity segments. In summary, I have developed a series of algorithms that are aimed at automatic interpretation of egocentric activity videos. It is demonstrated that depth data benefits egocentric activity recognition in terms of target localization and feature representation. It is also demonstrated that the proposed algorithms are significantly more robust than traditional algorithms when applied to the egocentric domain. This work contributes significantly to research on egocentric activity analysis.