The initial report on concurrent Phase-3 clinical trials assessing safety and efficacy of Oxford-AstraZeneca (Ox-AZ) COVID-19 vaccine is a superb example of Simpson’s Paradox—arguably the greatest threat to validity of statistical analysis in empirical science. I discuss the confounded Ox-AZ analysis in the context of the prevention of paradoxical findings.
The Insidious Problem of Simpson’s Paradox
Simpson’s Paradox (SP) is easy to understand. Consider a statistical analysis using a sample combining two or more different groups of subjects.
For example, imagine that one group is the two full doses (2D) group in the Ox-AZ Phase-3 trial, and that a second group is the half-dose-then-full-dose (1.5D) group.
By definition, SP occurs if the result of analysis obtained for a sample consisting of two or more different groups differs from results obtained by the same analysis conducted separately for each group. The problem is that the analytic result obtained using the combined sample does not generalize across all of the groups included in the combined sample—for some of the groups the finding of the combined analysis is invalid.
SP is a well-known problem which has been discussed over many decades in the statistical literature.
As an example of SP, consider that in the Ox-AZ trial the 2D group had 62% efficacy, and the 1.5D group had 90% efficacy. These results were weighted by the number of observations in the 2D and 1.5D groups, and efficacy for the combined group was reported as 70%.
The finding of a “combined” estimate of 70% is an inaccurate estimate of efficacy for the 2D (62%) and for the 1.5D group (90%—modestly lower than the efficacy which was reported earlier for two competing vaccines).
Adjust for Between-Sample Baseline Differences to Reduce Threats to Causal Inference
In the Ox-AZ Phase-3 study, the 2D and 1.5D samples differed on multiple dimensions—such as age, gender, and cultural milieu—which may theoretically (and/or empirically) be associated with outcome. For example, age and cultural factors, and associated physical, physiological, and behavioral decision-making co-factors, may affect recruitment into, retention in, and/or response to the Ox-AZ vaccine trial.
It is thus imperative to consider possible threats to causal inference. This is done by weighting subjects by propensity scores in order to adjust for baseline factors which differentiate samples. If more than one propensity score formulation can be identified, it is imperative to ensure that the model(s) used are correctly specified.
An Ox-AZ spokesperson commented that the highest research standards were used in their study. However, in the Ox-AZ Phase-3 study, the inclusion of the 1.5D sample was due to a manufacturing error. In light of the world-wide focus on this research, why did this ultimate error of not following the experimental design occur? What role did the FDA have in inspecting/supervising the Ox-AZ studies—should the role of the FDA be expanded to help maximize experimental rigor? Policy changes should be implemented to prevent any experimental methodology errors going forward.
It is imperative that Ox-AZ explain the underlying process whereby the 1.5D protocol had a successful response that was nearly 50% greater than the response of the actual 2D protocol that was intended by the Ox-AZ scientific team.
It is also crucial to explain why the science that recommended the 2D plan yielded such comparatively ineffective vaccine (62% efficacy).
Undermining Public Confidence in Science
The Ox-AZ collaboration was closely globally followed. The inexplicable experimental method error and subsequent invalid statistical analysis are a substantial setback for the Ox-AZ vaccine effort. Much worse, such a high-profile fiasco fuels further erosion of public confidence in science—starting with the “top institutions”.
To Learn More
Free-to-view references are cited if available, otherwise pay-to-view references are given.
SP has been widely studied in the ODA lab (https://www.britannica.com/topic/Simpsons-paradox).
SP occurs in many research areas involving time series, such as predicting monthly precipitation and temperature anomalies: https://odajournal.com/2013/09/19/the-use-of-unconfounded-climatic-data-improves-atmospheric-prediction/.
Confounding can occur in single-subject designs, for example evaluating symptom ratings made by one person across time: https://odajournal.com/2013/11/07/ascertaining-an-individual-patients-symptom-dominance-hierarchy-analysis-of-raw-longitudinal-data-induces-simpsons-paradox/
In studies involving combined groups and two or more variables, some forms of SP can be resolved, but other forms are unresolvable. The following article discusses every type of paradox which can occur for every possible result involving two groups:
- Yarnold PR (1996). Characterizing and circumventing Simpson’s paradox for ordered bivariate data. Educational and Psychological Measurement, 56, 430-442.
SP can occur when combining summary statistics from independent studies: https://odajournal.com/2015/04/21/estimating-inter-rater-reliability-using-pooled-data-induces-paradoxical-confounding-an-example-involving-emergency-severity-index-triage-ratings/.
ODA is used to study mediation, in order to identify causal pathways to outcomes:
- Linden A, Yarnold PR (2018). Identifying causal mechanisms in health care interventions using classification tree analysis. Journal of Evaluation in Clinical Practice, 24, 353-361. DOI: 10.1111/jep.12848
Results of paradoxical confounding which cause harm to people can be successfully litigated: https://odajournal.com/2013/09/19/junk-science-test-validity-and-the-uniform-guidelines-for-personnel-selection-procedures-the-case-of-melendez-v-illinois-bell/.
Results obtained studying the impact of Mask Mandates on COVID-19 cases varies as a function of heterogeneity of the sample being considered, providing clear evidence of Simpson’s Paradox and confounded findings.
- Maloney MJ, Rhodes NJ, Yarnold PR. Mask mandates can limit COVID spread: Quantitative assessment of month-over-month effectiveness of governmental policies in reducing the number of new COVID-19 cases in 37 US States and the District of Columbia. medRxiv Posted October 08, 2020. doi: https://doi.org/10.1101/2020.10.06.20208033
Novometric ODA is used to identify every statistically viable propensity score model which exists for the sample and varies as a function of complexity—thereby rendering model misspecification impossible:
- Linden A, Yarnold PR (2017). Using classification tree analysis to generate propensity score weights. Journal of Evaluation in Clinical Practice, 23, 703-712. DOI: 10.1111/jep.12744
Metric confounding occurs in multiple subject temporal designs evaluating change over time: https://odajournal.com/2013/10/23/ipsative-standardization-is-essential-in-the-analysis-of-serial-data/, and also in dose-response studies: https://odajournal.com/2016/05/31/using-machine-learning-to-model-dose-response-relationships-via-oda-eliminating-response-variable-baseline-variation-by-ipsative-standardization/.
Novometric theory, denoting a new standard or system of measurement (the current theoretical elaboration of the ODA paradigm) illuminates factors affecting vaccination decision-making: https://odajournal.com/2020/06/28/what-is-novometric-data-analysis/
Initial Findings of the Oxford-AstraZeneca COVID-19 Vaccine Efficacy Trial Are Confounded
Paul R. Yarnold, Ph.D.
December 9, 2020