Browsing by Subject "Big Data"

Now showing 1 - 3 of 3

Combining Strategies for Parallel Stochastic Approximation Monte Carlo Algorithm of Big Data
(2014-10-15) Lin, Fang-Yu
Modeling and mining with massive volumes of data have become popular in recent decades. However, it is difficult to analyze on a single commodity computer because the size of data is too large. Parallel computing is widely used. As a natural methodology, the divide-and-combine (D&C) method has been applied in parallel computing. The general method of D&C is to use MCMC algorithm in each divided data set. However, MCMC algorith is computationally expensive because it requires a large number of iterations and is prone to get trapped into local optima. On the other hand, Stochastic Approximation in Monte Carlo algorithm (SAMC), a very sophisticated algorithm in theory and applications, can avoid getting trapped into local optima and produce more accurate estimation than the conventional MCMC algorithm does. Motivated by the success of SAMC, we propose parallel SAMC algorithm that can be utilized on massive data and is workable in parallel computing. It can also be applied for model selection and optimization problem. The main challenge of the parallel SAMC algorithm is how to combine the results from each parallel subset. In this work, three strategies to overcome the combining difficulties are proposed. From the simulation results, these strategies result in significant time saving and accurate estimation. Synthetic Aperture Radar Interferometry (InSAR) is a technique of analyzing deformation caused by geophysical processes. However, it is limited by signal losses which are from topographic residuals. In order to analyze the surface deformation, we have to distinguish signal losses. Many methods assume the noise has second order stationary structure without testing it. The objective of this study is to examine the second order stationary assumption for InSAR noise and develop a parametric nonstationary model in order to demonstrate the effect of making incorrect assumption on random field. It indicates that wrong stationary assumption will result in bias estimation and large variation.
Crunch the market : a Big Data approach to trading system optimization
(2013-12) Mauldin, Timothy Allan; Aziz, Adnan
Due to the size of data needed, running software to analyze and tuning intraday trading strategies can take large amounts of time away from analysts, who would like to be able to evaluate strategies and optimize strategy parameters very quickly, ideally in the blink of an eye. Fortunately, Big Data technologies are evolving rapidly and can be leveraged for these purposes. These technologies include software systems for distributed computing, parallel hardware, and on demand computing resources in the cloud. This report presents a distributed software system for trading strategy analysis. It also demonstrates the effectiveness of Machine Learning techniques in decreasing parameter optimization workload. The results from tests run on two different commercial cloud service providers show linear scalability when analyzing intraday trading strategies.
Variable Selection for Ultra High Dimensional Data
(2014-05-29) Song, Qifan
Variable selection plays an important role for the high dimensional data analysis. In this work, we first propose a Bayesian variable selection approach for ultra-high dimensional linear regression based on the strategy of split-and-merge. The proposed approach consists of two stages: (i) split the ultra-high dimensional data set into a number of lower dimensional subsets and select relevant variables from each of the subsets, and (ii) aggregate the variables selected from each subset and then select relevant variables from the aggregated data set. Since the proposed approach has an embarrassingly parallel structure, it can be easily implemented in a parallel architecture and applied to big data problems with millions or more of explanatory variables. Under mild conditions, we show that the proposed approach is consistent. That is, asymptotically, the true explanatory variables will be correctly identified by the proposed approach as the sample size becomes large. Extensive comparisons of the proposed approach have been made with the penalized likelihood approaches, such as Lasso, elastic net, SIS and ISIS. The numerical results show that the proposed approach generally outperforms the penalized likelihood approaches. The models selected by the proposed approach tend to be more sparse and closer to the true model. In the frequentist realm, penalized likelihood methods have been widely used in variable selection problems, where the penalty functions are typically symmetric about 0, continuous and nondecreasing in (0,?). The second contribution of this work is that, we propose a new penalized likelihood method, reciprocal Lasso (or in short, rLasso), based on a new class of penalty functions which are decreasing in (0,?), discontinuous at 0, and converge to infinity when the coefficients approach zero. The new penalty functions give nearly zero coefficients infinity penalties; in contrast, the conventional penalty functions give nearly zero coefficients nearly zero penalties (e.g., Lasso and SCAD) or constant penalties (e.g., L0 penalty). This distinguishing feature makes rLasso very attractive for variable selection: It can effectively avoid selecting overly dense models. We establish the consistency of the rLasso for variable selection and coefficient estimation under both the low and high dimensional settings. Since the rLasso penalty functions induce an objective function with multiple local minima, we also propose an efficient Monte Carlo optimization algorithm to solve the minimization problem. Our simulation results show that the rLasso outperforms other popular penalized likelihood methods, such as Lasso, SCAD, MCP, SIS, ISIS and EBIC. It can produce sparser and more accurate coefficient estimates, and have a higher probability to catch true models.

Browsing by Subject "Big Data"

Results Per Page

Sort Options