Browsing by Subject "Big data"
Now showing 1 - 9 of 9
Results Per Page
Sort Options
Item A framework for processing connected vehicle data in transportation planning applications(2016-12) Deering, Amanda Marie; Bhat, Chandra R. (Chandrasekhar R.), 1964-This thesis presents a framework to process connected vehicle data into a format that is practical for implementation in the transportation planning field. Whereas prior research on connected vehicles has used theoretical models or small data samples for analysis, this study uses the largest public connected vehicle dataset currently available – the Sample Data Environment from the Safety Pilot Model Deployment project out of Ann Arbor, Michigan. This data includes basic safety messages and driving data for 2800 vehicles over two months. An algorithm to process basic safety message data into a trip level dataset is presented. This thesis also includes a process for spatial aggregation of trips into origin and destination zones using a hexagonal grid. These processes are implemented through the combination of a variety of open-source tools including Hadoop and PostgreSQL. Excerpts from the processed data are provided as well as sample analysis applications for the trip and spatial data. Recommendations and guidance are provided on handling the details of such an immense dataset. Since similar future vehicle-to-vehicle communications datasets are likely, it is imperative to develop methods to process and analyze this rich data effectively.Item Analyzing databases using data analytics(2015-12) Lee, Boum Hee; Lake, Larry W.; Mohanty, Kishore KThere are many public and private databases of oil field properties the analysis of which could lead to insights in several areas. The recent trend of Big Data has given rise to novel analytic methods to effectively handle multidimensional data, and to visualize them to discover new patterns. The main objective of this research is to apply some of the methods used in data analytics to datasets with reservoir data. Abstract Abstract Using a commercial reservoir properties database, we created and tested three data analytic models to predict ultimate oil and gas recovery efficiencies, using the following methods borrowed from data analytics: linear regression, linear regression with feature selection, and Bayesian network. We also adopted similarity ranking with principal component analysis to create a reservoir analog recommender system, which recognizes and ranks reservoir analogs from the database. Abstract Among the models designed to estimate recovery factors, the linear regression models created with variables selected with sequential feature selection method performed the best, showing strong positive correlations between actual and predicted values of reservoir recovery efficiencies. Compared to this model, Bayesian network model, and simple linear regression model performed poorly. Abstract For the reservoir analog recommender system, an arbitrary reservoir is selected, and different distance metrics were used to rank analog reservoirs. Because no one distance metric (and hence the given reservoir analog list) is superior to the other, the reservoirs given in the recommended list are compared along with the characteristics of distance metrics.Item Contextualizing privacy concerns within mobile engagement : a comparative investigation of escalating risk among general, e-commerce and health-related use(2016-05) Doorey, Alexandra Michelle; Eastin, Matthew S.; Wilcox, Gary BNew marketing paradigms constructed around capabilities for data collection, dissemination and analysis offer conveniences and benefits to consumers but also pose actual and perceived threats to privacy. As advertisers increasingly rely on individuals’ personal data captured from mobile devices, consumer perceptions and acceptance of advertising personalization practices become of critical importance and interest to communications scholars, practitioners and policymakers. New data streams, especially those generated by mobile devices, wearable technologies and mobile health and fitness tracking applications, offer unparalleled opportunity for behavioral targeting and personalization services. At the same time, sensitive personal information will support companies’ overall market goals by attracting and retaining consumers that view mobile advertising as value rather than annoyance or threat. Consumer adoption and attitudes toward mobile advertising compared with the practical application of different models in respective markets offers valuable insight into what practices are favored and welcomed by consumers and most likely to promote the growth of the mobile advertising industry. Utilizing the theoretical foundations of communication privacy management (CPM), the current study investigates dimensions of consumers’ information privacy concerns such as collection, control, awareness, unauthorized secondary use, improper access and location tracking to predict user engagement in generalized mobile activities, mobile commerce activities, and mobile health and fitness tracking activities. Data from this study indicate that privacy concerns are significant predictors of mobile engagement in contexts where information is perceived to be more sensitive to users. Moreover, this research suggests that across mobile activities, the privacy dimensions of unauthorized access and location tracking most significantly influence use.Item CPU performance in the age of big data : a case study with Hive(2016-12) Shulyak, Alexander Cole; John, Lizy KurianDistributed SQL Query Engines (DSQEs), like Hive, Shark, and Impala, have become the de-facto database set-up for Decision Support Systems with large database sizes. Unlike their single-threaded counterparts like MySQL, DSQEs experience inefficiencies related to the algorithm, code base, OS, and CPU micro-architecture that limit throughput despite the speedup from distributed execution. In my thesis, I present a detailed performance analysis of a DSQE called Hive, comparing it to MySQL, a single-threaded database application. Hive has difficulty converting queries into a set of MapReduce jobs for distributed execution. Hive also experiences a startup phase that is a significant overhead for short running queries. Additionally, both Hive and MySQL, like other server applications, experience high L1I miss rates due to a large code footprint. However, because MySQL is algorithmically efficient and traverses the database at a faster rate, it incurs a larger back-end bottleneck from LLC misses, which hides the front-end bottleneck. In contrast, Hive does not hide the high L1I cache miss rate with back-end stalls. Additionally, the higher context switch rates experienced by multi-process Hive setups thrash the first level caches, further inflaming the L1I cache miss rate. To address this micro-architectural inefficiency, I propose an instruction prefetch mechanism called Runahead Prefetch. It is similar to previously proposed branch prediction base prefetchers [19], but designed to easily extend modern Intel microarchitectures. Despite newer instruction prefetch mechanisms that discount branch prediction based prefching potential [8] [9] [12], I show Runahead Prefetch can eliminate 92% of L1I misses and 96% of icache stalls on average given modern branch misprediction rates and sufficient runahead.Item Nonparametric Inference for High Dimensional Data(2013-04-23) Mukhopadhyay, SubhadeepLearning from data, especially ?Big Data?, is becoming increasingly popular under names such as Data Mining, Data Science, Machine Learning, Statistical Learning and High Dimensional Data Analysis. In this dissertation we propose a new related field, which we call ?United Nonparametric Data Science? - applied statistics with ?just in time? theory. It integrates the practice of traditional and novel statistical methods for nonparametric exploratory data modeling, and it is applicable to teaching introductory statistics courses that are closer to modern frontiers of scientific research. Our framework includes small data analysis (combining traditional and modern nonparametric statistical inference), big and high dimensional data analysis (by statistical modeling methods that extend our unified framework for small data analysis). The first part of the dissertation (Chapters 2 and 3) has been oriented by the goal of developing a new theoretical foundation to unify many cultures of statistical science and statistical learning methods using mid-distribution function, custom made orthonormal score function, comparison density, copula density, LP moments and comoments. It is also examined how this elegant theory yields solution to many important applied problems. In the second part (Chapter 4) we extend the traditional empirical likelihood (EL), a versatile tool for nonparametric inference, in the high dimensional context. We introduce a modified version of the EL method that is computationally simpler and applicable to a large class of ?large p small n? problems, allowing p to grow faster than n. This is an important step in generalizing the EL in high dimensions beyond the p ? n threshold where the standard EL and its existing variants fail. We also present detailed theoretical study of the proposed method.Item OATS, CAT, and CARDS : financial regulation in the era of big data(2015-05) Moore, Peter Austin; Flamm, Kenneth, 1951-; Von Hippel, PaulThe explosion of data in the financial industry has led regulators to seek better ways to utilize big data analytics. This paper analyzes the inception and development of three major regulatory programs borne from market failures. These programs represent the promise of big data, but have had to withstand criticisms of their cost, effectiveness, and necessity. The focus is on the twin goals of these programs: to reconstruct the market and to detect market abuse; and how the promises have been met and criticisms have been replied to.Item ProGENitor : an application to guide your career(2014-12) Hauptli, Erich Jurg; Aziz, AdnanThis report introduces ProGENitor; a system to empower individuals with career advice based on vast amounts of data. Specifically, it develops a machine learning algorithm that shows users how to efficiently reached specific career goals based upon the histories of other users. A reference implementation of this algorithm is presented, along with experimental results that show that it provides quality actionable intelligence to users.Item Robust network compressive sensing(2015-12) Chen, Yi-Chao, Ph. D.; Qiu, Lili, Ph. D.; Lam, Simon; Lee, Sung-Ju; Mok, Aloysius; Ravikumar, PradeepNetworks are constantly generating an enormous amount of rich and diverse information. Such information creates exciting opportunities for network analytics and provides deep insights into the complex interactions among network entities. However, network analytics often faces the problem of (i) under-constraint, where there is too little data due to the feasibility/cost of collecting data; or (ii) over-constraint, where there is too much data so the analytics becomes unscalable. Compressive sensing is an effective technique to solve both problems. It leverages the underlying data structure for analysis. To address the under-constraint problem, we can apply compressive sensing to reconstruct missing elements or predict future data. To address the over-constraint problem, we can apply compressive sensing to identify important factors. Compressive sensing has many applications. In the thesis, we apply compressive sensing to missing data interpolation, anomaly detection, data segmentation, and activity recognition and show their benefit. To demonstrate the feasibility of compressive sensing in network analytics, we first apply it to detect anomalies in a customer care call dataset. Customer care call dataset is collected by a tier-1 ISP in US and includes the calls which are labeled as categories representing customers' problems. Customer care calls reveal the major events and problems observed by customers. We use a regression-based approach to find the relationship between calls and events. We show that compressive sensing is effective in identifying important factors and can leverage the low-rank structure and temporal stability of the data to improve the detection accuracy. While applying compressive sensing to the real-world data, we identify several challenges. One of the challenges is that real-world data are complicated and heterogeneous, and often violate the low-rank assumption required by existing compressive sensing techniques. Such violation significantly reduces the applicability and effectiveness of existing compressive sensing approaches. It is important to understand reasons behind the violation to design methods and mitigate the impact. Therefore, we analyze a wide range of real-world traces and our analysis reveals that there are different factors that contribute to the violation of low-rank property in real data. In particular, we find (i) noise, errors, anomalies, and (ii) the lack of synchronization in time and frequency-domain lead to network-induced blurring, and can easily cause a low-rank matrix to become a much higher rank. To address the problem of noise, errors, and anomalies, we present a robust compressive sensing technique. It explicitly account for anomalies by decomposing real-world data represented in the form of a matrix into a low-rank matrix, a sparse anomaly matrix, an error term, and a small noise matrix. To address the problem of the lack of synchronization, we present a data-driven synchronization algorithm. It removes misalignment while accounting for the time and frequency-domain heterogeneity in the real-world data. The data-driven synchronization can be applied to any compressive sensing technique and is general for any real-world trace. We show that the combination of two techniques can reduce the ranks of the real-world data, improve the effectiveness of compressive sensing, and have a wide range of applications.Item Visualization of multivariate process data for fault detection and diagnosis(2014-05) Wang, Ray Chen; Baldea, Michael; Edgar, Thomas F.This report introduces the concept of three-dimensional (3D) radial plots for the visualization of multivariate large scale datasets in plant operations. A key concept of this representation of data is the introduction of time as the third dimension in a two dimensional radial plot, which allows for the display of time series data in any number of process variables. This report shows the ability of 3D radial plots to conduct systemic fault detection and classification in chemical processes through the use of confidence ellipses, which capture the desired operating region of process variables during a defined period of steady-state operation. Principal component analysis (PCA) is incorporated into the method to reduce multivariate interactions and the dimensionality of the data. The method is applied to two case studies with systemic faults present (compressor surge and column flooding) as well as data obtained from the Tennessee Eastman simulator, which contained localized faults. Fault classification using the interior angles of the radial plots is also demonstrated in the paper.