Study on the relationship of training data size to error rate and the performance comparison for two decision tree algorithms

Date

2004-08

Journal Title

Journal ISSN

Volume Title

Publisher

Texas Tech University

Abstract

The decision tree model is a well accepted and widely used classification technique in the data mining field because of its advantages with fast construction, accuracy, and understandability. The decision tree model can be induced through algorithms, such as C4.5 and CART. This thesis research studies the relationship of training data size to error rate for the C4.5 and CART algorithms, and also compares the performance of both of them.

Several conclusions are drawn from the results of this thesis research; for example, the well accepted 66.7:33.3 splitting ratio in the literature can be increased to 80:20 for large data sets with more than 1000 samples to generate more accurate decision tree models. This thesis research also shows that the performance of C4.5 and CART on small data sets are similar, but differ on large data sets; therefore, large data sets are more suitable for comparing different algorithms.

Description

Citation