Study of the relationship of training set size to error rate in yet another decision tree And random forest algorithms

Raghavan, Ratheesh

Study of the relationship of training set size to error rate in yet another decision tree And random forest algorithms

Date

2006-05

Authors

Raghavan, Ratheesh

Publisher

Texas Tech University

Abstract

Classification algorithms are the among the widely used data mining techniques for prediction. Among their different types, the decision tree is a classification predictive model with significant advantages over the other techniques by being easy to interpret, having quick construction, having high accuracy and using fewer resources. The decision tree model can be developed by algorithms like C4.5, CART, YaDT, and Random Forest where their performance is determined by error rates.

This thesis research studies the relationship of training data size to error rate for the YaDT and Random Forest algorithms, and also compares the performance of both of them with the results of C4.5 & CART. This thesis research has been helpful in drawing various conclusions. For example, the well accepted 66.7:33.3 splitting ratio in the literature can be increased to 80:20 for large data sets with more than 1000 samples to generate more accurate decision tree models. The stability of all algorithms in the research is weak after 90:10 ratios due to very little testing data. This thesis research reveals that while YaDT performs similarly to C4.5 and CART, the performance of Random Forest is better than the other three significantly. The performance of models can be determined optimally with large data sets.