Ariel Linden & Paul R. Yarnold

Linden Consulting Group, LLC & Optimal Data Analysis, LLC

In a recent paper, we assessed the ability of several classification algorithms (logistic regression, random forests, boosted regression, support vector machines, and classification tree analysis [CTA]) to correctly not identify a relationship between the dependent variable and ten covariates generated completely at random. Only classification tree analysis correctly observed that no relationship existed. In this study, we examine whether various randomly derived subsets of the original N=1000 dataset change the ability of these models to correctly observe that no relationship exists. The randomly drawn samples were 250 and 500 observations. We further test the hold-out validity of these models by applying the generated model’s logic onto the remaining sample and computing the area under the receiver operator’s characteristics curve (AUC). Our results indicate that limiting the sample size has no effect on whether classification algorithms correctly determine that a relationship does not exist between variables in randomly generated data. Only CTA consistently identified that the data were random.