Browsing by Subject "Regression"
Now showing 1 - 20 of 22
Results Per Page
Sort Options
Item A New Method for History Matching and Forecasting Shale Gas/Oil Reservoir Production Performance with Dual and Triple Porosity Models(2012-10-19) Samandarli, OrkhanDifferent methods have been proposed for history matching production of shale gas/oil wells which are drilled horizontally and usually hydraulically fractured with multiple stages. These methods are simulation, analytical models, and empirical equations. It has been well known that among the methods listed above, analytical models are more favorable in application to field data for two reasons. First, analytical solutions are faster than simulation, and second, they are more rigorous than empirical equations. Production behavior of horizontally drilled shale gas/oil wells has never been completely matched with the models which are described in this thesis. For shale gas wells, correction due to adsorption is explained with derived equations. The algorithm which is used for history matching and forecasting is explained in detail with a computer program as an implementation of it that is written in Excel's VBA. As an objective of this research, robust method is presented with a computer program which is applied to field data. The method presented in this thesis is applied to analyze the production performance of gas wells from Barnett, Woodford, and Fayetteville shales. It is shown that the method works well to understand reservoir description and predict future performance of shale gas wells. Moreover, synthetic shale oil well also was used to validate application of the method to oil wells. Given the huge unconventional resource potential and increasing energy demand in the world, the method described in this thesis will be the "game changing" technology to understand the reservoir properties and make future predictions in short period of time.Item A STATISTICAL AND TAGUCHI PROCESS ANALYSIS AS APPLIED TO COTTON FIBER PROPERTIES AND WHITE SPECK OCCURRENCE(2010-12) Altintas, Pelin Z.; Beruvides, Mario G.; Simonton, James L.; Smith, Milton L.; Fedler, Clifford B.Cotton containing immature fibers is a major concern in the dyeing and finishing of textile products. In an un-dyed state, entangled fiber clusters are generically classified as neps. It is only after the application of dye, when some neps remain un-dyed, that the more specific classification of “white speck” is used. The High Volume Instrument (HVI) fiber property measurement system is important in marketing and general quality assessment of the cotton crop; however, HVI is not precise enough to address immature fiber content. The purpose of this research was to examine the relationship of Advanced Fiber Information System (AFIS) fiber properties to white speck counts of dyed yarn. This research looks at three sequential studies. First study looked at the within and between bale differences while establishing a regression model for white speck count and AFIS fiber properties of bale cotton and sliver cotton. Ten bales of cotton with a range of micronaire were sampled (10 samples per bale) and analyzed using AFIS with 3 replications of each counting 3,000 fibers. Each sample was then processed into yarn and dyed using the same procedure. White specks were quantified on dyed yarn using a white speck yarn counting method. Regression results indicated that fiber fineness, nep per gram, and immature fiber content found to be the influential indicators of white speck count in dyed yarn. However, small sample size and possible AFIS bias in the fineness and maturity measurements requires a larger sample size for further investigation. Second study analyzed the relationship between AFIS fiber properties and yarn white speck count by using statistical analysis. The treatments of harvest-aid chemical termination with varied harvest dates and two levels of field cleaning were included. Cotton samples of two crop years were sampled and analyzed using the AFIS with 3 replications, each counting 3,000 fibers. Each sample was processed into yarn and dyed with the same procedure. White speck counts on the yarn for each sample were conducted utilizing a white speck yarn methodology. The harvest date treatment influenced white speck count more than other fiber properties. The nep count by weight fiber property was also found to be one of the predictors of white speck count. However, the prediction model was not found to be as strong as the first study. Third study is the application of Taguchi method on the second study. Taguchi method is used to investigate the minimum white speck count in dyed yarn through fiber properties of varied harvest techniques. Signal-to-noise ratio (S/N ratio) was used to represent a response variable of white speck count. The smallest S/N ratio was chosen for this study. Among the control factors, harvest date, defoliation and field cleaner, the harvest date was found to be the significant effect on the S/N ratio of white speck count. The desirable outcome for white speck response was found to be early season harvesting and application of field cleaning and defoliation. By removing smaller, less mature bolls at early harvest date with field cleaner reduced the white speck count.Item Accounting for the effects of rehabilitation actions on the reliability of flexible pavements: performance modeling and optimization(2009-05-15) Deshpande, Vighnesh PrakashA performance model and a reliability-based optimization model for flexible pavements that accounts for the effects of rehabilitation actions are developed. The developed performance model can be effectively implemented in all the applications that require the reliability (performance) of pavements, before and after the rehabilitation actions. The response surface methodology in conjunction with Monte Carlo simulation is used to evaluate pavement fragilities. To provide more flexibility, the parametric regression model that expresses fragilities in terms of decision variables is developed. Developed fragilities are used as performance measures in a reliability-based optimization model. Three decision policies for rehabilitation actions are formulated and evaluated using a genetic algorithm. The multi-objective genetic algorithm is used for obtaining optimal trade-off between performance and cost. To illustrate the developed model, a numerical study is presented. The developed performance model describes well the behavior of flexible pavement before as well as after rehabilitation actions. The sensitivity measures suggest that the reliability of flexible pavements before and after rehabilitation actions can effectively be improved by providing an asphalt layer as thick as possible in the initial design and improving the subgrade stiffness. The importance measures suggest that the asphalt layer modulus at the time of rehabilitation actions represent the principal uncertainty for the performance after rehabilitation actions. Statistical validation of the developed response model shows that the response surface methodology can be efficiently used to describe pavement responses. The results for parametric regression model indicate that the developed regression models are able to express the fragilities in terms of decision variables. Numerical illustration for optimization shows that the cost minimization and reliability maximization formulations can be efficiently used in determining optimal rehabilitation policies. Pareto optimal solutions obtained from multi-objective genetic algorithm can be used to obtain trade-off between cost and performance and avoid possible conflict between two decision policies.Item Bayesian Methods in Nutrition Epidemiology and Regression-based Predictive Models in Healthcare(2012-02-14) Zhang, SaijuanThis dissertation has mainly two parts. In the first part, we propose a bivariate nonlinear multivariate measurement error model to understand the distribution of dietary intake and extend it to a multivariate model to capture dietary patterns in nutrition epidemiology. In the second part, we propose regression-based predictive models to accurately predict surgery duration in healthcare. Understanding the distribution of episodically consumed dietary components is an important problem in public health. Short-term measurements of episodically consumed dietary components are zero-inflated skewed distributions. So-called two-part models have been developed for such data. However, there is much greater public health interest in the usual intake adjusted for caloric intake. Recently a nonlinear mixed effects model has been developed and fit by maximum likelihood using nonlinear mixed effects programs. However, the fitting is slow and unstable. We develop a Monte-Carlo-based fitting method in Chapter II. We demonstrate numerically that our methods lead to increased speed of computation, converge to reasonable solutions, and have the flexibility to be used in either a frequentist or a Bayesian manner. Diet consists of numerous foods, nutrients and other components, each of which have distinctive attributes. Increasingly nutritionists are interested in exploring them collectively to capture overall dietary patterns. We thus extend the bivariate model described in Chapter III to multivariate level. We use survey-weighted MCMC computations to fit the model, with uncertainty estimation coming from balanced repeated replication. The methodology is illustrated through an application of estimating the population distribution of the Healthy Eating Index-2005 (HEI-2005), a multi-component dietary quality index , among children aged 2-8 in the United States. The second part of this dissertation is to accurately predict surgery duration. Prior research has identified the current procedural terminology (CPT) codes as the most important factor when predicting surgical case durations but there has been little reporting of a general predictive methodology using it effectively. In Chapter IV, we propose two regression-based predictive models. However, the naively constructed design matrix is singular. We thus devise a systematic procedure to construct a fullranked design matrix. Using surgical data from a central Texas hospital, we compare the proposed models with a few benchmark methods and demonstrate that our models lead to a remarkable reduction in prediction errors.Item Econometric analysis of the impact of market concentration on prices in the offshore drilling rig market(2010-12) Onwuka, Amanda Chiderah; Jablonowski, Christopher J.; Groat, Charles G.This thesis presents an econometric methodology for analyzing the impact of market concentration (HHI) on the day rate prices paid by E&P operators for the lease of drilling rigs. It is an extension of the work of Lee (2008), ‘Measuring the Impact of Concentration in the Drilling Rig Market’. Specifically, the work entailed using a more detailed time series data than was initially used (quarterly), analyzing impact of concentration on day rate prices by water depth specification of drilling rigs, and accounting for the impact of autocorrelation on the analysis. The results for jack-ups, without adjustment for autocorrelation, supported the results of the prior study i.e. showing that increase in HHI causes rig day rate price increase. However, the results for semi-submersibles was inconclusive as it varied from region to region and also was contrary to the assumptions of positive relationships between HHI and day rate prices made in this study. These results imply that market concentration caused both price increase and decrease within the industry depending on whether it increased market power or increased cost efficiency and technological ability.Item Efficient Estimation in a Regression Model with Missing Responses(2012-10-19) Crawford, ScottThis article examines methods to efficiently estimate the mean response in a linear model with an unknown error distribution under the assumption that the responses are missing at random. We show how the asymptotic variance is affected by the estimator of the regression parameter and by the imputation method. To estimate the regression parameter the Ordinary Least Squares method is efficient only if the error distribution happens to be normal. If the errors are not normal, then we propose a One Step Improvement estimator or a Maximum Empirical Likelihood estimator to estimate the parameter efficiently. In order to investigate the impact that imputation has on estimation of the mean response, we compare the Listwise Deletion method and the Propensity Score method (which do not use imputation at all), and two imputation methods. We show that Listwise Deletion and the Propensity Score method are inefficient. Partial Imputation, where only the missing responses are imputed, is compared to Full Imputation, where both missing and non-missing responses are imputed. Our results show that in general Full Imputation is better than Partial Imputation. However, when the regression parameter is estimated very poorly, then Partial Imputation will outperform Full Imputation. The efficient estimator for the mean response is the Full Imputation estimator that uses an efficient estimator of the parameter.Item Error analysis for randomized uniaxial stretch test on high strain materials and tissues(Texas A&M University, 2006-08-16) Jhun, Choon-SikMany people have readily suggested different types of hyperelastic models for high strain materials and biotissues since the 1940??s without validating them. But, there is no agreement for those models and no model is better than the other because of the ambiguity. The existence of ambiguity is because the error analysis has not been done yet (Criscione, 2003). The error analysis is motivated by the fact that no physical quantity can be measured without having some degree of uncertainties. Inelastic behavior is inevitable for the high strain materials and biotissues, and validity of the model should be justified by understanding the uncertainty due to it. We applied the fundamental statistical theory to the data obtained by randomized uniaxial stretch-controlled tests. The goodness-of-fit test (2R) and test of significance (t-test) were also employed. We initially presumed the factors that give rise to the inelastic deviation are time spent testing, stretch-rate, and stretch history. We found that these factors characterize the inelastic deviation in a systematic way. A huge amount of inelastic deviation was found at the stretch ratio of 1.1 for both specimens. The significance of this fact is that the inelastic uncertainties in the low stretch ranges of the rubber-like materials and biotissues are primarily related to the entropy. This is why the strain energy can hardly be determined by the experimentation at low strain ranges and there has been a deficiency in the understanding of the exclusive nature of the strain energy function at low strain ranges of the rubber-like materials and biotissues (Criscione, 2003). We also found the answers for the significance, effectiveness, and differences of the presumed factors above. Lastly, we checked the predictive capability by comparing the unused deviation data to the predicted deviation. To check if we have missed any variables for the prediction, we newly defined the prediction deviation which is the difference between the observed deviation and the point forecasting deviation. We found that the prediction deviation is off in a random way and what we have missed is random which means we didn??t miss any factors to predict the degree of inelastic deviation in our fitting.Item Exploring the incentive effects of food aid on crop production in Zambia(2009-05) Sikombe, Derrick; Knight, Thomas; Rejesus, Roderick M.; Lyford, ConradUnderstanding the effects food aid on crop production is very valuable especially in Zambia where food aid distribution to rural households has become a common phenomenon in recent years. In many respects, food aid can be considered as an important enabler to food production while at the same time it can potentially act as an impediment to sustainable agricultural growth. Integrating the effects of food aid on small holder productivity in designing agricultural programs can be very helpful and could provide decision makers with the right choices for sustaining agricultural growth in Zambia. This study analyzed the effects of food aid on the average quantities of maize produced by farmers in a community using two complimentary estimation procedures: OLS and quantile regressions. The OLS results show that there is a mean effect of food aid on average household maize production which is negative and significant (holding other observable factors constant). However, the results of the quantile regression show that food aid has distinct impacts at different points of the conditional maize production distribution. This shows for example that communities producing small quantities of maize are affected by food aid differently relative to communities that produce large quantities of maize. The quantile regression results actually show that communities at the lower end of the maize production conditional distribution (and in the region of the mean) tend to have stronger negative effects of food aid. The effect however reduces in magnitude in the extreme upper end of the distribution (at the 90th quantile) even though this effect is not statistically significant. Both the OLS and quantile regression results provide evidence that food aid distributed to communities does reduces household maize production significantly (at least at many points of the maize production distribution in the case of quantile regression results). These results suggest that it would be appropriate to carefully evaluate continuation of food aid programs in agricultural development as this approach results in an estimated average reduction in maize production of 2,000 Kgs for every 1,000 Kgs of food aid received by a community in the last season. While the results suggest a negative effect of food aid at the community level, it should be recognized that the available data did not support panel estimation, which would have allowed us to correct for fixed or random productivity effects. We compensated for this data limitation by including province level dummy variables and a lagged dependent variable. However, panel estimation would still be preferred. Future work could strengthen the implications of the results by using panel data at the household level.Item Hybrid Rocket Burning Rate Enhancement by Nano-Scale Additives in HTPB Fuel Grains(2014-12-10) Thomas, James CLow regression rates in hybrid rockets limit their use and capability, but additive aluminum nano-particles represent a possible solution to this problem. In this thesis, aluminum nano-particles were characterized and added to hybrid motor grains to assess their effects on the combustion behavior of hybrid rocket fuel grains. Procedures for the fabrication of 6-inch-long motors with combustion port diameters of 1 cm and 2.54 cm (1 inch) were developed for formulations with and without additive particles. The implementation of commercial aluminum particles at a mass loading of 5% as a burning rate enhancer was assessed on a lab-scale burner. Traditional temporally and spatially averaged techniques were applied to determine the regression rates of plain and aluminized HTPB motors burning in gaseous oxygen. Resistance-based regression sensors were embedded in motor grains and used to determine instantaneous and averaged burning rates. The resistive-based sensors exhibited good accuracy and unique capabilities not achievable with other regression measurement techniques, but still have limitations. The addition of commercial nano-aluminum, with a diameter of 100 nm, to hybrid motors increased the motor surface regression rate for oxidizer mass fluxes in the range of 0-15 g/cm2-s. Future testing will focus on the evaluation of motors containing novel aluminum particles manufactured in situ with the HTPB at a mass loading of 5%, which are expected to perform better than similar commercially aluminized motors.Item Large-scale network analytics(2011-08) Song, Han Hee, 1978-; Zhang, Yin, doctor of computer scienceScalable and accurate analysis of networks is essential to a wide variety of existing and emerging network systems. Specifically, network measurement and analysis helps to understand networks, improve existing services, and enable new data-mining applications. To support various services and applications in large-scale networks, network analytics must address the following challenges: (i) how to conduct scalable analysis in networks with a large number of nodes and links, (ii) how to flexibly accommodate various objectives from different administrative tasks, (iii) and how to cope with the dynamic changes in the networks. This dissertation presents novel path analysis schemes that effectively address the above challenges in analyzing pair-wise relationships among networked entities. In doing so, we make the following three major contributions to large-scale IP networks, social networks, and application service networks. For IP networks, we propose an accurate and flexible framework for path property monitoring. Analyzing the performance side of paths between pairs of nodes, our framework incorporates approaches that perform exact reconstruction of path properties as well as approximate reconstruction. Our framework is highly scalable to design measurement experiments that span thousands of routers and end hosts. It is also flexible to accommodate a variety of design requirements. For social networks, we present scalable and accurate graph embedding schemes. Aimed at analyzing the pair-wise relationships of social network users, we present three dimensionality reduction schemes leveraging matrix factorization, count-min sketch, and graph clustering paired with spectral graph embedding. As concrete applications showing the practical value of our schemes, we apply them to the important social analysis tasks of proximity estimation, missing link inference, and link prediction. The results clearly demonstrate the accuracy, scalability, and flexibility of our schemes for analyzing social networks with millions of nodes and tens of millions of links. For application service networks, we provide a proactive service quality assessment scheme. Analyzing the relationship between the satisfaction level of subscribers of an IPTV service and network performance indicators, our proposed scheme proactively (i.e., detect issues before IPTV subscribers complain) assesses user-perceived service quality using performance metrics collected from the network. From our evaluation using network data collected from a commercial IPTV service provider, we show that our scheme is able to predict 60% of the service problems that are complained by customers with only 0.1% of false positives.Item Low pH waters in the vicinity of Oak Hill Mine : a statistical evaluation of water quality(2014-08) Mercier, Lilith Joy; Sharp, John Malcolm, Jr.Lignite (brown coal) mine-mouth power plants supply a significance portion of electricity generated annually in Texas. Most lignite is produced from the Wilcox Group at surface mines located near a power plant. At the Oak Hill Mine, a lignite mine in the Sabine Uplift area of northeast Texas, the presence of low pH seeps has delayed the release of some portions of the reclaimed land from bond of some until all surface water bodies achieves a stable pH between 6 and 9. But this federal requirement may require an artificial elevation of surface water pH above the natural range for low volume, groundwater-fed surface water bodies in that region. The primary objective of this thesis is to determine whether the distribution of groundwater pH at Oak Hill Mine has become more acidic as a result of mining activity. This study shows that low pH (<6.0) groundwater was common within the mine permit area prior to mining activities; the 95% confidence interval for the median pH of overburden pre-disturbance (OP) wells is 4.7 to 4.8. This naturally occurring, low pH groundwater is produced by the weathering (oxidative dissolution) of pyrite in the Carrizo Sand and overburden Wilcox Group. Although low pH groundwater occurs naturally within the Oak Hill Mine permit area, groundwater pH has also decreased (groundwater has become more acidic) as a result of mining activities. The 95% confidence interval for the median pH of overburden reclamation (OR) wells is 4.1 to 4.2, indicating that mining activities has changed the median groundwater pH by approximately -0.5 standard units. Underburden groundwater is less acidic than overburden groundwater, but also becomes more acidic after mining activities. Underburden pre-disturbance (UP) groundwater has a median pH of 6.2 to 6.3 at the 95% confidence interval, whereas underburden reclamation (UR) groundwater has a median pH of 5.6 to 5.8 at the 95% confidence interval.Item Methods for analyzing proportions(2013-08) Moeller, Megan Michelle; Jessee, Stephen A., 1980-The analysis of proportions is interesting and noteworthy in that there are no commonly accepted regression models for analyzing proportions; indeed, researchers most often use ordinary least squares to estimate the parameters of a linear regression model for proportional data. Such an approach, however, violates several assumptions of the Classical Linear Regression Model. This report outlines the general linear model and the problems associated with using this approach to model proportions and considers a variety of alternate approaches that researchers have taken to model proportions. These alternatives include transforming the dependent variable, a censored regression (Tobit) model, a Fractional Logit model, and Beta Regression. All of the approaches considered are implemented in a case study analyzing Rice party difference scores in the 93rd to 108th Congress. A comparison of the results from each approach confirms the findings of other researchers that Beta regression is the most preferred approach for modeling proportions.Item Municipal economic growth through green projects and policies(2012-05) Lindner, Harry Dreyfus; Gamkhar, ShamaCities generally need economic growth. Green policies and projects are environmentally beneficial, desirable, expected by the public, and pragmatic in the long term. However, there is insufficient research on what, if any, municipal green projects and policies generate economic growth. To address this question, the author created a database of green and economic indicators, and modeled the green indicators to predict the economic indicators. The database included carbon usage, transportation metrics, water usage, the number of green jobs, and the gross domestic product (GDP) for the 100 largest cities, defined by metropolitan statistical area (MSA), in the U.S. To gather data on green indicators, existing green rankings, indices, and reports were evaluated for methodology and usability for this paper. The results of the data-gathering step show the need for more and better data collection. That means an increased number of green indicators should be collected, and data should be collected at the MSA (or county) level for more of the largest cities. Specifically to name some green indicators, data collection on energy usage, buildings, waste, land use, air quality, and food could be improved. Those green indicators would likely be included in a model that uses green indicators to predict green jobs or GDP. However, those were not included for the regressions in this paper. The results of the regressions in this paper show two indicators with promise for predicting economic growth as defined by GDP and number of green jobs: (1) percent of people using public transportation, biking, or walking to work, and (2) public water consumption per person. The first explanatory variable indirectly measures the adoption of policies that promote public transportation, biking, and walking. The results suggest that these policies have a positive effect on GDP and number of green jobs. This means the results suggest that as the percent increases, so does GDP and number of green jobs. The second explanatory variable measures the water conservation policies. The results suggest this variable has a negative, albeit weaker relationship with GDP per person. This means the results suggest as water conservation increases (less water usage per person), the GDP per person increases. This paper offers a methodology and some of the groundwork for building a model to show which, if any, municipal green projects and policies predict economic growth.Item Open source software maturity model based on linear regression and Bayesian analysis(2009-05-15) Zhang, DongminOpen Source Software (OSS) is widely used and is becoming a significant and irreplaceable part of the software engineering community. Today a huge number of OSS exist. This becomes a problem if one needs to choose from such a large pool of OSS candidates in the same category. An OSS maturity model that facilitates the software assessment and helps users to make a decision is needed. A few maturity models have been proposed in the past. However, the parameters in the model are assigned not based on experimental data but on human experiences, feelings and judgments. These models are subjective and can provide only limited guidance for the users at the best. This dissertation has proposed a quantitative and objective model which is built from the statistical perspective. In this model, seven metrics are chosen as criteria for OSS evaluation. A linear multiple-regression model is created to assign a final score based on these seven metrics. This final score provides a convenient and objective way for the users to make a decision. The coefficients in the linear multiple-regression model are calculated from 43 OSS. From the statistical perspective, these coefficients are considered random variables. The joint distribution of the coefficients is discussed based on Bayesian statistics. More importantly, an updating rule is established through Bayesian analysis to improve the joint distribution, and thus the objectivity of the coefficients in the linear multiple-regression model, according to new incoming data. The updating rule provides the model the ability to learn and improve itself continually.Item An overview of multilevel regression(2010-12) Kaplan, Andrea Jean; Smith, Martha K., 1944-; Luecke, John EdwinDue to the inherently hierarchical nature of many natural phenomena, data collected rests in nested entities. As an example, students are nested in schools, school are nested in districts, districts are nested in counties, and counties are nested within states. Multilevel models provide a statistical framework for investigating and drawing conclusions regarding the influence of factors at differing hierarchical levels of analysis. The work in this paper serves as an introduction to multilevel models and their comparison to Ordinary Least Squares (OLS) regression. We overview three basic model structures: variable intercept model, variable slope model, and hierarchical linear model and illustrate each model with an example of student data. Then, we contrast the three multilevel models with the OLS model and present a method for producing confidence intervals for the regression coefficients.Item Quantification of stock option risks and returns(2010-05) Feng, Haoqi, 1983-; Greenberg, Betsy S.; Brockett, Patrick L.Under mild assumptions, the expected returns of call options increase as the strike price becomes higher. Two ways to define option moneyness are the ratio of strike price to stock price (K/S ratio) and log(K/S)/σ. This paper examines the positive relationship between the call option returns and the correspondent risks by establishing linear models regarding the option returns and the two ratios. Furthermore, these ratios can be used to predict the option returns based on the regression models in practice.Item Regression : when a nonparametric approach is most fitting(2012-05) Claussen, Pauline Elma Clara; Brockett, PatrickThis paper aims to demonstrate the benefits of adopting a nonparametric regression approach when the standard regression model is not appropriate; it also provides an overview of circumstances where a nonparametric approach might not only be beneficial, but necessary. It begins with a historical background on regression, leading into a broad discussion of the standard linear regression model assumptions. Following are particular methods to handle assumption violations which include nonlinear transformations, nonlinear parametric model fitting, and, finally, nonparametric methods. The software package, R, is used to illustrate examples of nonparametric regression techniques for continuous variables and a brief overview is given of procedures to handle nonparametric regression models that include categorical variables.Item Regression model ridership forecasts for Houston light rail(2012-12) Sides, Patton Christopher; Evans, Angela M.; McCray, TaliaThe 4-step process has been the standard procedure for transit forecasting for over 50 years. In recent decades, researchers have developed ridership forecasting regression models as alternatives to the costly and time consuming 4-step process. The model created by Lane, DiCarlantonio, and Usvyat in 2006 is among the most recent and most widely accepted. It includes station area demographics, central business district (CBD) employment, and the station areas’ built environments to estimate ridership. This report applies the Lane, DiCarlantonio, and Usvyat model to the North Line of Houston’s Metropolitan Transit Authority of Harris County (METRO). The report compares the 2030 ridership forecast created by METRO using the 4-step process with the LDU model forecasts. For the 2030 projections, this report obtained population and employment estimates from the Houston-Galveston Area Council and analyzed the data using Esri ArcMap and Caliper TRANSCadGIS software programs. The LDU model produced unrealistically high ridership numbers for the North Line. It estimated 108,430,481 daily boardings. METRO’s 4-step process predicted 29,900 daily boardings. The results suggest that the LDU model is not applicable to the Houston light rail system and is not a viable alternative to the 4-step process for this specific metropolitan area. The LDU method for defining Houston’s CBD was the main problem in applying the model. It calculated an extremely high CBD employment density compared to other cities of similar size. Even when the CBD size was manipulated to decrease employment density, the model still predicted 212,210 daily boardings for the North Line, nearly 10 times higher than METRO’s 4-step process estimate. In addition to the problems with the definition of the CBD, the creators of the LU model did not specifically explain how to define a metropolitan area. Multiple inconsistent and subjective definitions of a metro area can be used. This report employs three different definitions of the Houston metro, all of which produced three significantly different ridership forecasts in the LDU model. As a result of these flaws, the LDU model does not accurately apply to METRO’s North Line, and it does not serve as a viable alternative to METRO’s 4-step process.Item Simultaneous partitioning and modeling : a framework for learning from complex data(2010-05) Deodhar, Meghana; Ghosh, Joydeep; John, Lizy; Chase, Craig; Dhillon, Inderjit; Saar-Tsechansky, MaytalWhile a single learned model is adequate for simple prediction problems, it may not be sufficient to represent heterogeneous populations that difficult classification or regression problems often involve. In such scenarios, practitioners often adopt a "divide and conquer" strategy that segments the data into relatively homogeneous groups and then builds a model for each group. This two-step procedure usually results in simpler, more interpretable and actionable models without any loss in accuracy. We consider prediction problems on bi-modal or dyadic data with covariates, e.g., predicting customer behavior across products, where the independent variables can be naturally partitioned along the modes. A pivoting operation can now result in the target variable showing up as entries in a "customer by product" data matrix. We present a model-based co-clustering framework that interleaves partitioning (clustering) along each mode and construction of prediction models to iteratively improve both cluster assignment and fit of the models. This Simultaneous CO-clustering And Learning (SCOAL) framework generalizes co-clustering and collaborative filtering to model-based co-clustering, and is shown to be better than independently clustering the data first and then building models. Our framework applies to a wide range of bi-modal and multi-modal data, and can be easily specialized to address classification and regression problems in domains like recommender systems, fraud detection and marketing. Further, we note that in several datasets not all the data is useful for the learning problem and ignoring outliers and non-informative values may lead to better models. We explore extensions of SCOAL to automatically identify and discard irrelevant data points and features while modeling, in order to improve prediction accuracy. Next, we leverage the multiple models provided by the SCOAL technique to address two prediction problems on dyadic data, (i) ranking predictions based on their reliability, and (ii) active learning. We also extend SCOAL to predictive modeling of multi-modal data, where one of the modes is implicitly ordered, e.g., time series data. Finally, we illustrate our implementation of a parallel version of SCOAL based on the Google Map-Reduce framework and developed on the open source Hadoop platform. We demonstrate the effectiveness of specific instances of the SCOAL framework on prediction problems through experimentation on real and synthetic data.