# Optimizing Suboptimal Classification Trees: Matlab® CART Model Predicting Probability of Lower Limb Prosthesis User’s Functional Potential

Paul R. Yarnold & Ariel Linden

Optimal Data Analysis, LLC & Linden Consulting Group, LLC

After any algorithm which controls the growth of a classification tree model has completed, the resulting model must be pruned in order to explicitly maximize predictive accuracy normed against chance. This article illustrates manually-conducted maximum-accuracy pruning of a classification and regression tree (CART) model that was developed to predict the functional capacity of lower limb prosthesis users.

View journal article

# Regression vs. Novometric Analysis Predicting Income Based on Education

Paul R. Yarnold

Optimal Data Analysis, LLC

This study compares linear regression vs. novometric models of the association of education and income for a sample of 32 observations. Regression analysis identified a relatively strong effect (R-squared=56.4), but only 25% of point predictions fell within a 20% band of actual income. Novometric analysis identified a strong effect (ESS=81.7%) which was stable in jackknife validity analysis: the model correctly classified 91.7% of observations earning income less than \$12,405, and 90.0% of those earning greater income. For people with an income which is less than the optimal threshold, and for those earning greater income, factors other than the number of years of education influenced earned income.

View journal article

# Effect of Sample Size on Discovery of Relationships in Random Data by Classification Algorithms

Ariel Linden & Paul R. Yarnold

Linden Consulting Group, LLC & Optimal Data Analysis, LLC

In a recent paper, we assessed the ability of several classification algorithms (logistic regression, random forests, boosted regression, support vector machines, and classification tree analysis [CTA]) to correctly not identify a relationship between the dependent variable and ten covariates generated completely at random. Only classification tree analysis correctly observed that no relationship existed. In this study, we examine whether various randomly derived subsets of the original N=1000 dataset change the ability of these models to correctly observe that no relationship exists. The randomly drawn samples were 250 and 500 observations. We further test the hold-out validity of these models by applying the generated model’s logic onto the remaining sample and computing the area under the receiver operator’s characteristics curve (AUC). Our results indicate that limiting the sample size has no effect on whether classification algorithms correctly determine that a relationship does not exist between variables in randomly generated data. Only CTA consistently identified that the data were random.

View journal article

# ODA vs. χ2, r, and τ: Trauma Exposure in Childhood and Duration of Participation in Eating-Disorder Treatment Program

Paul R. Yarnold

Optimal Data Analysis, LLC

This note illustrates the disorder and confusion attributable to analytic ethos whereby a smorgasbord of different statistical tests are used to test identical or parallel statistical hypotheses. Herein four classic methods are used for an application with a binary class (dependent) variable and an ordered attribute (independent variable) measured using a five-point scale. Legacy methods reach different conclusions—which is correct? In absolute contrast, for a given sample and hypothesis novometric analysis identifies every statistically viable model (models vary as functions of precision and complexity) which reproducibly maximizes the predictive accuracy for the sample.

View journal article

# Novometric Stepwise CTA Analysis Discriminating Three Class Categories Using Two Ordered Attributes

Paul R. Yarnold & Ariel Linden

Optimal Data Analysis, LLC & Linden Consulting Group, LLC

The adaptability of novometric analysis is illustrated for an example involving three class categories and two ordered attributes.

View journal article

# Some Machine Learning Algorithms Find Relationships Between Variables When None Exist — CTA Doesn’t

Ariel Linden & Paul R. Yarnold

Linden Consulting Group, LLC & Optimal Data Analysis, LLC

Automated machine learning algorithms are widely promoted as the best approach for estimating propensity scores, because these methods detect patterns in the data which manual efforts fail to identify. If classification algorithms are indeed ideal for identifying relationships between treatment group participation and covariates which predict participation, then it stands to reason that these algorithms would also be unable to find relationships when none exist (i.e., covariates do not predict treatment group assignment). Accordingly, we compare the predictive accuracy of maximum-accuracy classification tree analysis (CTA) vs. classification algorithms most commonly used to obtain the propensity score (logistic regression, random forests, boosted regression, and support vector machines). However, here we use an artificial dataset in which ten continuous covariates are randomly generated and by design have no correlation with the binary dependent variable (i.e., treatment assignment). Among all of the algorithms tested, only CTA correctly failed to discriminate between treatment and control groups based on the covariates. These results lend further support to the use of CTA for generating propensity scores as an alternative to other common approaches which are currently in favor.

View journal article

# Optimal Markov Model Relating Two Time-Lagged Outcomes

Paul R. Yarnold

Optimal Data Analysis, LLC

This paper demonstrates the use of maximum-accuracy weighted Markov analysis to model the relationship between two time-lagged variables—serial ratings of pain during the day and subsequent quality of sleep at night—for an individual.

View journal article