Paul R. Yarnold & Robert C. Soltysik
Optimal Data Analysis, LLC
Imagine a random sample S consisting of a class variable (“dependent measure”), one or more attributes (“independent measures”), a weight (unit-weighted observations are equally-valued), and a number of observations N yielding at least minimally adequate statistical power for testing the a priori or post hoc hypothesis that the attributes predict the class variable. The null hypothesis is the attributes can’t predict the class variable. A statistical model is identified for S that optimally (most accurately) predicts the (weighted) class variable on the basis of the attributes. This is the first axiom underlying novometrics—meaning new (Latin: novo) measurement, and connoting a newly discovered theoretically-motivated algorithm that explicitly identifies the globally-optimal (GO) statistical model(s) underlying S. Originating from operations research, “optimal” as used here denotes explicitly maximized (weighted) classification accuracy for S: that is, predicting the class category of (weighted) observations in S as accurately as is theoretically possible for S. Novometry identifies the nature and strength of the GO relationship(s) between a class variable and one or more attributes for S, where nature and strength are characterized by the number and homogeneity of discrete sample strata identified by GO statistical model(s). Models maximizing ESS (a normed index of effect strength) and efficiency (ESS/number of strata, a normed index of parsimony) prevent over-fitting and promote cross-generalizability when using the model to classify an independent random S. This article demonstrates novometry for elemental applications involving one binary class variable and one ordered attribute. Using data from the Surveillance, Epidemiology, and End Results (SEER) Program, cancer incidence is parsed separately by sex (male, female) and by race (white, African American) to ascertain whether these class variables identify discrete patient strata differing in cancer incidence.