Real-time object detection for autonomous driving-based on deep learning

Liu, Guangrui

Real-time object detection for autonomous driving-based on deep learning

Authors

Description

A thesis Submitted in Partial Fulfillment of the Requirements for the Degree of MASTER OF SCIENCE in COMPUTER SCIENCE from Texas A&M University-Corpus Christi in Corpus Christi, Texas.
Optical vision is an essential component for autonomouscars. Accurate detection of vehicles, street buildings, pedestrians and road signs could assist self-driving cars the drive as safely as humans. However, object detection has been a challenging task for decades since images of objects in the real-world environment are affected by illumination, rotation, scale, and occlusion. In recent years, many Convolutional Neural Network (CNN) based classification-after-localization methods have improved detection results in various conditions. However, the slow recognition speed of these two-stage methods limits their usage in real-time situations. Recently, a unified object detection model, You Only Look Once (YOLO) [20], was proposed, which could directly regress from input image to object class scores and positions. Its single network structure processes images at 45 fps on PASCAL VOC 2007 dataset [7] and has higher detection accuracy than other current real-time methods. However, when applied to auto-driving object detection tasks, this model still has limitations. It processes images individually despite the fact that an object's position changes continuously in the driving scene. Thus, the model ignores alot of important information between continuous frames. In this research, we applied YOLO to three different datasets to test its general applicability. We fully analyzed its performance from various aspects on KITTI dataset [10] which is specialized for autonomous driving. We proposed a novel technique called memory map, which considers inter-frame information, to strengthen YOLO's detection ability in driving scene. We broadened the model's applicability scope by applying it to a new orientation estimation task. KITTI is our main dataset. Additionally, ImageNet [5] dataset is used for pre-training, and three other datasets. And Pascal VOC 2007/2012 [7], Road Sign [2], and Face Detection Dataset and Benchmark (FDDB) [15] were used for other class domains.
Computing Sciences
College of Science and Engineering