Browsing by Subject "Logistic regression"

Now showing 1 - 13 of 13

A landscape approach to reserving farm ponds for wintering bird refuges in Taoyuan, Taiwan
(Texas A&M University, 2006-08-16) Fang, Wei-Ta
Man-made farm ponds are unique geographic features of the Taoyuan Tableland. Besides irrigation, they provide refuges for wintering birds. The issue at hand is that these features are disappearing and bring with it the loss of this refuge function. It is ecologically significant because one fifth of all the bird species in Taiwan find a home on these ponds. This study aims at characterizing the diversity of bird species associated with these ponds whose likelihood of survival was assessed along the gradient of land development intensities. Such characterization helps establish decision criteria needed for designating certain ponds for habitat preservation and developing their protection strategies. A holistic model was developed by incorporating logistic regression with error back-propagation into the paradigm of artificial neural networks (ANN). The model considers pond shape, size, neighboring farmlands, and developed areas in calculating parameters pertaining to their respective and interactive influences on avian diversity, among them the Shannon-Wiener diversity index (H??). Results indicate that ponds with regular shape or the ones with larger size possess a strong positive correlation with H??. Farm ponds adjacent to farmland benefited waterside bird diversity. On the other hand, urban development was shown to cause the reduction of farmland and pond numbers, which in turn reduced waterside bird diversity. By running the ANN model with four neurons, the resulting H?? index shows a good-fit prediction of bird diversity against pond size, shape, neighboring farmlands, and neighboring developed areas with a correlation coefficient (r) of 0.72, in contrast to the results from a linear regression model (r < 0.28). Analysis of historical pond occurrence to the present showed that ponds with larger size and a long perimeter were less likely to disappear. Smaller (< 0.1 ha) and more curvilinear ponds had a more drastic rate of disappearance. Based on this finding, a logistic regression was constructed to predict pond-loss likelihood in the future and to help identify ponds that should be protected. Overlaying results from ANN and form logistic regression enabled the creation of pond-diversity maps for these simulated scenarios of development intensities with respective to pond-loss trends and the corresponding dynamics of bird diversity.
Comparison of prediction methods for batter-pitcher matchups
(2016-05) Thakur, Siddhartha; Bickel, J. Eric; Hasenbein, John J.
Baseball is full of confrontations and these confrontations between a batter and the pitcher is what makes the game. If a formula would be able to predict the probability of the outcome correctly, when they meet, wouldn’t it instill confidence in the minds of the head coach (or you if you are playing the fantasy) to select someone who would be on the winning end? We would like to know for sure, which of our batters are good, and what out of the small amount of possible outcomes, will be the result when he faces this other good pitcher from the team you face next. It seems the past performance of the batter against this pitcher can be a good indicator, and that is what presumably the methods currently used utilize. But the utility of the Batter vs. Pitcher data in predicting the future outcome is a debate going on for quite a time now. The reason for this debate stems from the fact that the sample size of this data is so small that it becomes hard to comprehend when to prefer information you get from a sample size of thousands of atbats against all pitchers vs. maybe a few dozen against specific individuals. The report will discuss one of the famous methods, called Log5 [1] that has been utilized so far when it comes to measuring the outcomes of these confrontations. It also discusses the other methods like logistic regression based on the past data and the new and upcoming Morey-Z. [3]
Data mining techniques for classifying RNA folding structures
(2016-08) Kim, Vince; Garg, Vijay K. (Vijay Kumar), 1963-; Gutell, Robin R
RNA is a crucial biological molecule that is critical for protein synthesis. Significant research has been done on folding algorithms for RNA, in particular the 16S rRNA of bacteria and archaea. Rather than modifying current works on these folding algorithms, this report ventures into the pioneering works for data mining the same 16S rRNA. Initial works were based on a single complex helix across seven organisms. However, classification analysis proved to be inaccurate due to severe multicollinearity in the data set. A secondary data mining analysis was done on the entire RNA sequence of the same seven organisms, and was successfully used to sequentially categorically predict the characteristic of a given nucleotide in the RNA sequence.
Exploration of statistical approaches to estimating the risks and costs of fire in the United States
(2012-08) Anderson, Austin David; Ezekoye, Ofodike A.
Knowledge of fire risk is crucial for manufacturers and regulators to make correct choices in prescribing fire protection systems, especially flame retardants. Methods of determining fire risk are bogged down by a multitude of confounding factors, such as population demographics and overlapping fire protection systems. Teasing out the impacts of one particular choice or regulatory change in such an environment is crucial. Teasing out such detail requires statistical techniques, and knowledge of the field is important for verifying potential methods. Comparing the fire problems between two states might be one way to identify successful approaches to fire safety. California, a state with progressive fire prevention policies, is compared to Texas using logistic regression modeling to account for various common factors such as percentage of rural population and percentage of population in ‘risky’ age brackets. Results indicate that living room fires, fires in which the first item ignited is a flammable liquid, piping, or filter, and fires started by cigarettes, pipes, and cigars have significantly higher odds of resulting in a casualty or fatality than fires started by other areas of origin, items first ignited, or heat sources. Additionally, fires in Texas have roughly 1.5 times higher odds of resulting in casualties than fires in California for certain areas of origin, items first ignited, and heat sources. Methods of estimating fire losses are also examined. The potential of using Ramachandran’s power-law relationship to estimate fire losses in residential home fires in Texas is examined, and determined to be viable but not discriminating. CFAST is likewise explored as a means to model fire losses. Initial results are inconclusive, but Monte Carlo simulation of home geometries might render the approach viable.
Identifying historical financial crisis: Bayesian stochastic search variable selection in logistic regression
(2009-08) Ho, Chi-San; Damien, Paul, 1960-; Greenberg, Betsy S.
This work investigates the factors that contribute to financial crises. We first study the Dow Jones index performance by grouping the daily adjusted closing value into a two-month window and finding several critical quantiles in each window. Then, we identify severe downturn in these quantiles and find that the 5th quantile is the best to identify financial crises. We then matched these quantiles with historical financial crises and gave a basic explanation about them. Next, we introduced all exogenous factors that could be related to the crises. Then, we applied a rapid Bayesian variable selection technique - Stochastic Search Variable Selection (SSVS) using a Bayesian logistic regression model. Finally, we analyzed the result of SSVS, leading to the conclusion that that the dummy variable we created for disastrous hurricane, crude oil price and gold price (GOLD) should be included in the model.
A logistic regression analysis for potentially insolvent status of life insurers in the United States
(2011-05) Xue, Xiaolei; Sager, Thomas W.; Myers, Margaret E.
This study focused on identifying factors that significantly affect the potentially insolvent status of life insurers. The potentially insolvent status is indicated based on insurer’s Risk-based capital ratio (RBC ratio) reported in the National Association of Insurance Commissioners (NAIC) database of life insurers’ annual statements. A logistic regression analysis is performed to explore the relationship between the RBC insolvent indicator and a set of explanatory variables including insurer’s size, capital, governance structure, membership in a group of affiliated companies, and various risk measures during the 2006-2008 period. The results suggest that the probability of potential insolvency for an individual insurer is significantly affected by its size, capital-to-asset ratio, returns on capital, health product risk and proportion of products reinsured. It could be also possibly affected by the insurer’s regulatory asset risk. However, the results indicate that the probability is not significant related to the insurer’s annuity product risk, opportunity asset risk, governance structure and its membership in a group of affiliated companies. On average, by holding all other explanatory variables constant, every 1% increase in total assets will result in a decrease of 0.19 to 0.36% on the odds of potentially insolvent rates; every 0.01 unit increase in capital-to-asset ratio will result in a decrease of a multiplicative factor of 0.951 to 0.956 on the odds; every 0.01 unit increase in return on capital will result in a decrease of a multiplicative factor of 0.984 to 0.985 on the odds; every 0.01 unit increase in health product risk will result in an increase of a multiplicative factor of 1.021 to 1.031 on the odds; and every 0.01 unit increase in proportion of products reinsured will result in an increase of a multiplicative factor of 1.015 to 1.026 on the odds. The assumptions of independency and absence of harmful multicolliearity are both valid for this logistic model, suggesting that the model is adequate and the conclusion is warranted. Although the potentially insolvent indicator, instead of the real insolvent indicator is used, this model could still be useful to identify the significant factors which affect life insurers’ potentially insolvent status.
Logistic Regression in Predictive Modeling of Admitted Student Enrollment
(2010-12) Logan, Ethan; Shonrock, Michael D.; Lan, William; Burkhalter, James P.
The application of Predictive Modeling within enrollment management provides a tremendous tool for building and shaping future enrollments for institutions of higher education. Though the practice of predictive modeling is a well known application in private business practices, the use of predictive modeling in enrollment management has only recently been employed since 1990s. The independent higher education consulting firm Noel-Levitz popularized the introduction of predictive modeling as a method of providing enrollment management professionals in higher education the opportunity to forecast the possible enrolling classes of students in their institutions. This study followed a recommended strategy for application of predictive modeling for enrollment management. Stephen DesJardins, Ph.D., from the University of Michigan published a methodology for applying predictive modeling to the process of recruiting and admitting students in an attempt to provide institutions who were actively involved in predictive modeling programs, or those who could not afford independent consulting organizations who provided predictive modeling services. The recommended method of predictive modeling as prescribed by DesJardins will be adapted to an entering class of freshmen at a large, public 4-year institution of higher education in the Southwest. The class of 2009 will be analyzed in order to build a model of predictive modeling which will then subsequently be applied to the class of 2010. The effectiveness of both the model and the application were analyzed for effectiveness, since both of these classes have already matriculated.
Modeling roadway incident risk using aggregated real-time detector data
(2015-12) Gold, Andrea Lynn; Press, William H.; Walton, C. Michael
This report applies previously developed methodology from authors Abdel-Aty et. al. in a 2005 Institute of Transportation Engineers (ITE) Journal to predict roadway conditions with high risk of incidents. The methodology, which includes logistic regression modeling and hazard ratio estimation, is applied to a large, high frequency dataset generated by roadway detectors in the I-80 corridor in the San Francisco-Oakland-Berkley area. Results differ from the original ITE paper and model features do not show strong relationships with increased incident risk on the I-80 corridor with the possible exception of standard deviation in speeds. Concluding thoughts offer insights into reasons the methodology may have failed.
New Advances in Logistic Regression for Handling Missing and Mismeasured Data with Applications in Biostatistics
(2014-05-30) Miao, Jingang
As a probabilistic statistical classification model, logistic regression (or logit regression) is widely used to model the outcome of a categorical dependent variable based on one or more predictor variables/features. We study two problems related to logistic regression with applications in biostatistics. In the first problem, we study multivariate disease classification in the presence of partially missing disease traits. In modern cancer epidemiology, diseases are classified based on pathologic and molecular traits, and different combinations of these traits give rise to many disease subtypes. The effect of predictor variables can be measured by fitting a polytomous logistic model to such data. The differences (heterogeneity) among the relative risk parameters associated with subtypes are of great interest to better understand disease etiology. Due to the heterogeneity of the relative risk parameters, when a risk factor is changed, the prevalence of one subtype may change more than that of another subtype does. Estimation of the heterogeneity parameters is difficult when disease trait information is only partially observed and the number of disease subtypes is large. We consider a robust semiparametric approach based on the pseudo conditional likelihood for estimating these heterogeneity parameters. Through simulation studies, we compare the robustness and efficiency of our approach with the maximum likelihood approach. The method is then applied to analyze data from the American Cancer Society Cancer Prevention Study (CPS) II Nutrition Cohort. Weight gain was associated with the risk of breast cancer and the association varies by disease subtype. In the second problem, we use a semiparametric Bayesian method to handle measurement errors. In nutritional epidemiological studies, nutrient intakes are often measured via food frequency questionnaires and 24-hour dietary recalls. Due to self reporting, recall error, and other reasons, the measured nutrient intakes can involve a substantial amount of noise. While independence assumption between the measurement error and the true predictor is likely to be a reasonable assumption for the main effect of the predictors, this assumption is not tenable for the interaction effect of two predictors measured with error. Although there are a number of flexible methods for handling additive, homogeneous measurement error in predictors in logistic regression models, relatively less attention has been paid to handling measurement error that depends on the unobserved predictor. Therefore, we propose a semiparametric Bayesian method for handling this unorthodox measurement error scenario in logistic regression models in the presence of the interaction term. The proposed method is also designed to handle partially missing values for the error-prone surrogate variables. Through simulation studies, we assess some operating characteristics of the proposed method and compare it with the simulation extrapolation and the regression calibration method. Our method has smaller biases than the other methods. In addition, we analyze the NHANES data and assess the association between some important nutrients and high cholesterol level. Total fat and protein reinforce each other's association with the risk of having high cholesterol level.
Predicting success of bank telemarketing with classification trees and logistic regression
(2016-05) Yang, Chuanfeng; Zhou, Mingyuan (Assistant professor); Gawande, Kishore
Success of bank marketing campaign is predicted with customer features, campaign information and economic attributes. To predict whether or not clients will subscribe long-term deposit, logistic regression is applied with backward variable selection and principal components analysis. Random forests and stochastic gradient boosting, as members of classification trees, are also built as comparisons. Based on visualization and quantitative predictive performance, gradient boosting (AUC = 0.791) is slightly better than the other two models. Variable importance from 3 models remains consistent for most variables. Social and economic attributes, such as euribor3m, are among top important variables.
Statistical analysis of variables associated with convective initiation along the dryline
(Texas Tech University, 2006-08) Griesinger, Michael Patrick; Weiss, Christopher C.; Chang, Chia-Bo; Peterson, Richard E.
This thesis deals with the issue of convective initiation along the Southern Great Plains Dryline. Multiple variables were calculated for dryline days during the springs of 2004 and 2005 in West Texas through the use of surface observations obtained from the West Texas Mesonet and composite soundings generated through interpolation between soundings released in Midland and Amarillo, TX. With these data, a stepwise logistic regression process was used to generate a forecasting equation for the likelihood that a present dryline will be associated with deep moist convection. This equation is then tested in four case studies to determine its accuracy and potential to be used in operational forecasting of convective initiation along the West Texas dryline.
A study of courteous behavior on the University of Texas campus
(2010-12) Lu, Zhou, 1978-; Stolp, Chandler; Powers, Daniel A.
This study focused on measuring courteous behavior on the University of Texas at Austin (UT) students on campus. This behavior was measured through analyzing various factors involved when a person opened the door for another. The goal was to determine which factors would significantly affect the probability that a person would hold a door for another. Three UT buildings with no automatic doors were selected (RLM, FAC and GRE), and 200 pairs of students at each location were observed to see whether they would open doors for others. These subjects were not disturbed during the data collection process. For each observation, the door holding conditions, genders, position (whether it was the one who opened the door or the recipient of this courteous gesture, abbreviated as recipient), distance between the person opening the door and the recipient, and the number of recipients were recorded. Descriptive statistics and logistic regression were used to analyze the data. The results showed that the probability of people opening the doors for others was significantly affected by gender, position, distance between the person opening the door and the recipient, the number of recipients, and the interaction term between gender and position. The study revealed that men had a slightly higher propensity of opening the doors for the recipients. The odds for men were a multiplicative factor of 1.09 of that for women on average, holding all other factors constant. However, women had much higher probability of having doors held open for them. The odds for men were a multiplicative factor of 0.55 of that for women on average, holding all other factors constant. In terms of the distance between the person opening the door and the recipient, for each meter increase in distance, the odds that the door would be held open would decrease by a multiplicative factor of 0.40 on average. Additionally, for each increase in number of recipients, the odds that the door would be held open would increase by a multiplicative factor of 1.32 on average.
Use of statins and the development of incident diabetes mellitus : a retrospective cohort study
(2015-05) Olotu, Busuyi Sunday; Shepherd, Marvin D.; Lawson, Kenneth A; Wilson, James P; Richards, Kristin M; Novak, Suzanne
Statins are pharmaceutical agents used in lowering blood cholesterols levels. Several landmark statin trials have demonstrated the beneficial effects of statins in both primary and secondary prevention of cardiovascular disease. Although statins are generally safe and well tolerated, several studies have suggested that statins are associated with a moderate increase in risk of new-onset diabetes. These observations prompted the FDA to revise statin labels to now include a warning of an increased risk of incident diabetes mellitus as a result of increases in glycosylated hemoglobin (A1C) and fasting plasma glucose (FPG). However, few studies have used US-based data to investigate this statin-associated increased risk of diabetes. Thus, the purpose of this study was to evaluate whether statin use was associated with an increased risk of new-onset diabetes. In addition, this study evaluated whether diabetes risk was increased when patients received intensive statin doses. This study was a retrospective cohort analysis that utilized data from the Thomson Reuters MarketScan® Commercial Claims Database for the period of 2003 - 2004. The study population included new statin users who were aged 20 - 63 years at index and who do not have a history of diabetes. Among the study population (N=116,224), 6.5% (or N=7,593) had incident diabetes. Compared to no statin use, statin use was significantly associated with increased risk of incident diabetes (HR=2.752; 99% C.I.=2.535 - 2.987; p<0.0001). In addition, each statin type (i.e., atorvastatin, fluvastatin, lovastatin, pravastatin, rosuvastatin, and simvastatin) was associated with about a two-fold increase in risk of diabetes. Diabetes risk was highest among lovastatin users and lowest among rosuvastatin users. Furthermore, diabetes risk was higher among intensive-dose statin users compared to moderate-dose statin users (HR=1.540; 99% C.I.=1.393 – 1.704; p<0.0001). Because of the proven benefits of statins in both primary and secondary prevention of cardiovascular disease, and because the potential for diabetogenicity differs among statins, health care professionals should individualize statin therapy by identifying patients who would benefit from less diabetogenic statin types. This could help optimize treatment by providing the highest benefit achievable while reducing the number of patients developing diabetes under statin therapy.

Browsing by Subject "Logistic regression"

Results Per Page

Sort Options