By Ben Holmes, senior clinical data analyst, Syapse.
When it comes to getting a clear picture from real-world data, breadth of view and careful analysis matter equally.
Interpreting data is always a challenge; it’s a problem space with high dimensionality, deeply interrelated variables, and where data completeness is defined in infinite ways. Separating actionable insights from mountains of data requires rigorous statistical validation, thoughtful modeling, and a variety of analytic approaches. Biostatisticians take these steps to avoid biasing results, and to make sure that samples are truly representative and relationships between variables are accounted for.
But even with all possible care and due diligence taken, it’s possible to arrive at skewed results if the view from the data sources included is limited by their inherent biases. For example, mortality is an important data element in oncology research that helps oncologists communicate chances of remission to their patients. Yet, in the real-world setting, there isn’t a single complete source for mortality data that can be used to better understand remission and survival rates.
This is, partly, because many of the traditional mortality data sources only apply to certain groups of patients. For example, death data from hospital registries is only applicable for patients in cases where registry data is available. Additionally, registries tend to rely on electronic health record (EHR) and obituary data to capture deceased status, which do not naturally account for all patients—for example, women and minorities are less likely to have obituaries. With that in mind, datasets that rely heavily on obituary data alone are going to under-represent deaths and overall survival curves associated with women and minorities. This finding is consistent with recently published studies of digitized obituaries which showed that women were awarded significantly fewer obituaries compared to men.
To illustrate the shortcomings of single data sources and to provide a strategy for overcoming them, a 2021 study published in the American Society of Clinical Oncology’s JCO-CCI used a random selection of cancer patients diagnosed between 2011 and 2017 to examine reliability of dates of death. Crucially, there were several sources used, including obituaries, hospital EHR feeds, social security death index files, data from the National Cancer Institute’s (NCI) SEER program, tumor registry data, and chart abstraction conducted by Certified Tumor Registrars (CTRs). Using a waterfall method, a composite score from all sources was created; that is, sources were arranged in order of trust, and if any source had a date of death for a patient, that was taken to be the date of death for that patient. No further comparisons were made—any patient with no date of death was assumed to be alive. This composite score was then compared against the National Death Index (NDI), along with the individual sources.
Amazingly, the study found that the data quality across the board was solid—it was rare to find notable disagreement between any sources. Further, when the patient population was divided along demographic lines (race, age, sex, socioeconomic status), it was found that every source under-reported a particular demographic—for instance, obituaries were more likely to miss female patients than male. Just as notably, no one demographic was under-reported by every source.
The analogy of swiss cheese is helpful to understand how and why this is significant. Every source has gaps throughout, similar to how every slice of swiss cheese has holes throughout. Place enough slices of swiss cheese on top of each other, and you’re likely to fill in all the gaps. This is exactly what happened with the combined score—sources that under-reported one source were compensated for by another source that didn’t. These gaps in reporting would have been impossible to overcome without the strategic use of a variety of sources.
Altitude matters in data analysis—we must be able to see as much of the landscape as possible. The lesson to take from this study should be a question on everyone’s mind when it comes to medical data analysis: how complete a picture am I getting of the data? And more importantly: how do we overcome the intrinsic bias of our sources?