Globally Optimal Statistical Classification Models, I: Binary Class Variable, One Ordered Attribute

Paul R. Yarnold & Robert C. Soltysik

Optimal Data Analysis, LLC

Imagine a random sample S consisting of a class variable (“dependent measure”), one or more attrib­utes (“independent measures”), a weight (unit-weighted obser­vations are equally-valued), and a number of observations N yielding at least minimally ade­quate statistical power for testing the a priori or post hoc hypothesis that the attributes predict the class variable. The null hypothesis is the attributes can’t predict the class varia­ble. A statistical model is identified for S that optimally (most accurately) predicts the (weighted) class variable on the basis of the attributes. This is the first axiom underlying novometrics—meaning new (Latin: novo) measure­ment, and connoting a newly dis­covered theoretically-motivated algorithm that explicitly identifies the globally-optimal (GO) statistical model(s) underlying S. Originating from operations research, “optimal” as used here denotes explicitly maximized (weighted) classification accuracy for S: that is, predicting the class category of (weighted) observations in S as accurately as is theoreti­cally possible for S. Novometry identifies the nature and strength of the GO relation­ship(s) between a class variable and one or more at­trib­utes for S, where nature and strength are char­acter­ized by the number and homo­geneity of discrete sample strata identified by GO statistical model(s). Models maximizing ESS (a normed index of effect strength) and efficiency (ESS/number of strata, a normed index of parsimony) prevent over-fitting and promote cross-generalizability when using the model to classify an independent random S. This article demonstrates novometry for elemental applications involving one binary class variable and one ordered attribute. Using data from the Surveillance, Epidemiology, and End Results (SEER) Program, cancer incidence is parsed separately by sex (male, female) and by race (white, African American) to ascertain whether these class variables identify dis­crete patient strata dif­fering in cancer in­cidence.

View journal article