Study on the relationship of training data size to error rate and the performance comparison for two decision tree algorithms

dc.creatorZheng, Jianjun
dc.date.accessioned2016-11-14T23:28:00Z
dc.date.available2011-02-18T22:06:24Z
dc.date.available2016-11-14T23:28:00Z
dc.date.issued2004-08
dc.degree.departmentComputer Scienceen_US
dc.description.abstractThe decision tree model is a well accepted and widely used classification technique in the data mining field because of its advantages with fast construction, accuracy, and understandability. The decision tree model can be induced through algorithms, such as C4.5 and CART. This thesis research studies the relationship of training data size to error rate for the C4.5 and CART algorithms, and also compares the performance of both of them. Several conclusions are drawn from the results of this thesis research; for example, the well accepted 66.7:33.3 splitting ratio in the literature can be increased to 80:20 for large data sets with more than 1000 samples to generate more accurate decision tree models. This thesis research also shows that the performance of C4.5 and CART on small data sets are similar, but differ on large data sets; therefore, large data sets are more suitable for comparing different algorithms.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/2346/17571en_US
dc.language.isoeng
dc.publisherTexas Tech Universityen_US
dc.rights.availabilityUnrestricted.
dc.subjectErrors -- Measurement -- Analysisen_US
dc.subjectDecision making -- Mathematical modelsen_US
dc.subjectDecision logic tablesen_US
dc.subjectChi-square testen_US
dc.subjectDistribution (Probability theory)en_US
dc.subjectDecision treesen_US
dc.subjectEntropy -- Measurementen_US
dc.subjectAlgorithmsen_US
dc.titleStudy on the relationship of training data size to error rate and the performance comparison for two decision tree algorithms
dc.typeThesis

Files