# FORUM: Initial Findings of the Oxford-AstraZeneca COVID-19 Vaccine Efficacy Trial Are Confounded

The initial report on concurrent Phase-3 clinical trials assessing safety and efficacy of Oxford-AstraZeneca (Ox-AZ) COVID-19 vaccine is a superb example of Simpson’s Paradox—arguably the greatest threat to validity of statistical analysis in empirical science. I discuss the confounded Ox-AZ analysis in the context of the prevention of paradoxical findings.

The Insidious Problem of Simpson’s Paradox

Simpson’s Paradox (SP) is easy to understand. Consider a statistical analysis using a sample combining two or more different groups of subjects.

For example, imagine that one group is the two full doses (2D) group in the Ox-AZ Phase-3 trial, and that a second group is the half-dose-then-full-dose (1.5D) group.

By definition, SP occurs if the result of analysis obtained for a sample consisting of two or more different groups differs from results obtained by the same analysis conducted separately for each group. The problem is that the analytic result obtained using the combined sample does not generalize across all of the groups included in the combined sample—for some of the groups the finding of the combined analysis is invalid.

SP is a well-known problem which has been discussed over many decades in the statistical literature.

As an example of SP, consider that in the Ox-AZ trial the 2D group had 62% efficacy, and the 1.5D group had 90% efficacy. These results were weighted by the number of observations in the 2D and 1.5D groups, and efficacy for the combined group was reported as 70%.

The finding of a “combined” estimate of 70% is an inaccurate estimate of efficacy for the 2D (62%) and for the 1.5D group (90%—modestly lower than the efficacy which was reported earlier for two competing vaccines).

Adjust for Between-Sample Baseline Differences to Reduce Threats to Causal Inference

In the Ox-AZ Phase-3 study, the 2D and 1.5D samples differed on multiple dimensions—such as age, gender, and cultural milieu—which may theoretically (and/or empirically) be associated with outcome. For example, age and cultural factors, and associated physical, physiological, and behavioral decision-making co-factors, may affect recruitment into, retention in, and/or response to the Ox-AZ vaccine trial.

It is thus imperative to consider possible threats to causal inference. This is done by weighting subjects by propensity scores in order to adjust for baseline factors which differentiate samples. If more than one propensity score formulation can be identified, it is imperative to ensure that the model(s) used are correctly specified.

Experimental Rigor

An Ox-AZ spokesperson commented that the highest research standards were used in their study. However, in the Ox-AZ Phase-3 study, the inclusion of the 1.5D sample was due to a manufacturing error. In light of the world-wide focus on this research, why did this ultimate error of not following the experimental design occur? What role did the FDA have in inspect­ing/supervising the Ox-AZ studies—should the role of the FDA be expanded to help maximize experimental rigor? Policy changes should be implemented to prevent any experimental methodology errors going forward.

Vaccine Theory

It is imperative that Ox-AZ explain the underly­ing process whereby the 1.5D proto­col had a successful response that was nearly 50% greater than the response of the actual 2D protocol that was intended by the Ox-AZ scientific team.

It is also crucial to explain why the science that recommended the 2D plan yielded such com­paratively ineffective vaccine (62% efficacy).

Undermining Public Confidence in Science

The Ox-AZ collaboration was closely globally followed. The inexplicable experimental method error and subsequent invalid statistical analysis are a substantial setback for the Ox-AZ vaccine effort. Much worse, such a high-profile fiasco fuels further erosion of public confidence in science—starting with the “top institutions”.

Free-to-view references are cited if available, otherwise pay-to-view references are given.

SP has been widely studied in the ODA lab (https://www.britannica.com/topic/Simpsons-paradox).

SP occurs in many research areas involving time series, such as predicting monthly precipitation and temperature anomalies: https://odajournal.com/2013/09/19/the-use-of-unconfounded-climatic-data-improves-atmospheric-prediction/.

Confounding can occur in single-subject designs, for example evaluating symptom ratings made by one person across time: https://odajournal.com/2013/11/07/ascertaining-an-individual-patients-symptom-dominance-hierarchy-analysis-of-raw-longitudinal-data-induces-simpsons-paradox/

In studies involving combined groups and two or more variables, some forms of SP can be resolved, but other forms are unresolva­ble. The following article discusses every type of paradox which can occur for every possible result involving two groups:

• Yarnold PR (1996). Characterizing and circumventing Simpson’s paradox for ordered bivariate data. Educational and Psychological Measurement, 56, 430-442.

SP can occur when com­bining summary statistics from independent studies: https://odajournal.com/2015/04/21/estimating-inter-rater-reliability-using-pooled-data-induces-paradoxical-confounding-an-example-involving-emergency-severity-index-triage-ratings/.

ODA is used to study mediation, in order to identify causal pathways to outcomes:

• Linden A, Yarnold PR (2018). Identifying causal mechanisms in health care interventions using classification tree analysis. Journal of Evaluation in Clinical Practice, 24, 353-361. DOI: 10.1111/jep.12848

Results of paradoxical confounding which cause harm to people can be successfully litigated: https://odajournal.com/2013/09/19/junk-science-test-validity-and-the-uniform-guidelines-for-personnel-selection-procedures-the-case-of-melendez-v-illinois-bell/.

Results obtained studying the impact of Mask Mandates on COVID-19 cases varies as a function of heterogeneity of the sample being considered, providing clear evidence of Simpson’s Paradox and confounded findings.

• Maloney MJ, Rhodes NJ, Yarnold PR. Mask mandates can limit COVID spread: Quantitative assessment of month-over-month effectiveness of governmental policies in reducing the number of new COVID-19 cases in 37 US States and the District of Columbia. medRxiv Posted October 08, 2020. doi: https://doi.org/10.1101/2020.10.06.20208033

Novometric ODA is used to identify every statistically viable propensity score model which exists for the sample and varies as a function of complexity—thereby rendering model misspecification impossible:

• Linden A, Yarnold PR (2017). Using classification tree analysis to generate propensity score weights. Journal of Evaluation in Clinical Practice, 23, 703-712. DOI: 10.1111/jep.12744

Metric confounding occurs in multiple subject temporal designs evaluating change over time: https://odajournal.com/2013/10/23/ipsative-standardization-is-essential-in-the-analysis-of-serial-data/, and also in dose-response studies: https://odajournal.com/2016/05/31/using-machine-learning-to-model-dose-response-relationships-via-oda-eliminating-response-variable-baseline-variation-by-ipsative-standardization/.

Novometric theory, denoting a new standard or system of measurement (the current theoretical elaboration of the ODA paradigm) illuminates factors affecting vaccination decision-making: https://odajournal.com/2020/06/28/what-is-novometric-data-analysis/

Initial Findings of the Oxford-AstraZeneca COVID-19 Vaccine Efficacy Trial Are Confounded

Paul R. Yarnold, Ph.D.

December 9, 2020