Browsing by Subject "Hadoop"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item A framework for processing connected vehicle data in transportation planning applications(2016-12) Deering, Amanda Marie; Bhat, Chandra R. (Chandrasekhar R.), 1964-This thesis presents a framework to process connected vehicle data into a format that is practical for implementation in the transportation planning field. Whereas prior research on connected vehicles has used theoretical models or small data samples for analysis, this study uses the largest public connected vehicle dataset currently available – the Sample Data Environment from the Safety Pilot Model Deployment project out of Ann Arbor, Michigan. This data includes basic safety messages and driving data for 2800 vehicles over two months. An algorithm to process basic safety message data into a trip level dataset is presented. This thesis also includes a process for spatial aggregation of trips into origin and destination zones using a hexagonal grid. These processes are implemented through the combination of a variety of open-source tools including Hadoop and PostgreSQL. Excerpts from the processed data are provided as well as sample analysis applications for the trip and spatial data. Recommendations and guidance are provided on handling the details of such an immense dataset. Since similar future vehicle-to-vehicle communications datasets are likely, it is imperative to develop methods to process and analyze this rich data effectively.Item Design and implementation of scalable hierarchical density based clustering(2010-05) Dhandapani, Sankari; Ghosh, Joydeep; Gupta, GunjanClustering is a useful technique that divides data points into groups, also known as clusters, such that the data points of the same cluster exhibit similar properties. Typical clustering algorithms assign each data point to at least one cluster. However, in practical datasets like microarray gene dataset, only a subset of the genes are highly correlated and the dataset is often polluted with a huge volume of genes that are irrelevant. In such cases, it is important to ignore the poorly correlated genes and just cluster the highly correlated genes. Automated Hierarchical Density Shaving (Auto-HDS) is a non-parametric density based technique that partitions only the relevant subset of the dataset into multiple clusters while pruning the rest. Auto-HDS performs a hierarchical clustering that identifies dense clusters of different densities and finds a compact hierarchy of the clusters identified. Some of the key features of Auto-HDS include selection and ranking of clusters using custom stability criterion and a topologically meaningful 2D projection and visualization of the clusters discovered in the higher dimensional original space. However, a key limitation of Auto-HDS is that it requires O(n*n) storage, and O(n*n*logn) computational complexity, making it scale up to only a few 10s of thousands of points. In this thesis, two extensions to Auto-HDS are presented for lower dimensional datasets that can generate clustering identical to Auto-HDS but can scale to much larger datasets. We first introduce Partitioned Auto-HDS that provides significant reduction in time and space complexity and makes it possible to generate the Auto-HDS cluster hierarchy on much larger datasets with 100s of millions of data points. Then, we describe Parallel Auto-HDS that takes advantage of the inherent parallelism available in Partitioned Auto-HDS to scale to even larger datasets without a corresponding increase in actual run time when a group of processors are available for parallel execution. Partitioned Auto-HDS is implemented on top of GeneDIVER, a previously existing Java based streaming implementation of Auto-HDS, and thus it retains all the key features of Auto-HDS including ranking, automatic selection of clusters and 2D visualization of the discovered cluster topology.Item Fusion-based Hadoop MapReduce job for fault tolerance in distributed systems(2013-05) Ho, Iat-Kei; Garg, Vijay K. (Vijay Kumar), 1963-Standard recovery solution on a failed task in Hadoop systems is to execute the task again. After retrying for a configured number of times, it is marked as failure. With significant amount of data, complicated Map and Reduce functions, recovering corrupted or unfinished data from a failed job can be more efficient than re-executing the same job. This paper is an extension of [1] by applying fusion-based technique [7][8] in Hadoop MapReduce tasks execution to enhance its fault tolerance. Multiple data sets are executed through Hadoop MapReduce with and without fusion in various pre-defined failure scenarios for comparison. As the complexity of the Map and Reduce function relative to the Recover function increases, it becomes more efficient to utilize fusion and users can tolerate faults by incurring less than ten percent of extra execution time.Item Hadoop MapReduce for Mobile Cloud(2014-04-17) George, JohnuThe new generations of mobile devices have high processing power and storage, but they lag behind in terms of software systems for big data storage and processing. Hadoop is a scalable platform that provides distributed storage and computational capabilities on clusters of commodity hardware. Building Hadoop on a mobile net- work enables the devices to run data intensive computing applications without direct knowledge of underlying distributed systems complexities. However, these applications have severe energy and reliability constraints (e.g., caused by unexpected device failures or topology changes in a dynamic network). As mobile devices are more susceptible to unauthorized access when compared to traditional servers, security is also a concern for sensitive data. Hence, it is paramount to consider reliability, energy efficiency and security for such applications. The goal of this thesis is to bring Hadoop MapReduce framework to a mobile cloud environment such that it solves these bottlenecks involved in big data processing. The Mobile Distributed File System(MDFS) addresses these issues for big data processing in mobile clouds. We have developed the Hadoop MapReduce framework over MDFS and have evaluated its performance by varying input workloads in a real heterogeneous mobile cluster. Our evaluation shows that the implementation addresses all constraints in processing large amounts of data in mobile clouds. Thus, our system is a viable solution to meet the growing demands of data processing in a mobile environment.