A New Framework for Handling Missing Data as RWD Sources Rise

Interview on Ontada research presented at ISPOR 2024.

Increased variety of real-world data (RWD) sources can give scientists a richer picture than ever of what’s happening to patients. But for every new wave of insurance claims, patient surveys, or Apple Watch data, there are questions:

What’s missing? And how can these disparate data sources be used together while ensuring accuracy?

Zhaohui Su, PhD, vice president of biostatistics at Ontada, examined these issues in a presentation May 6, 2024, during the annual meeting of ISPOR—The Professional Society for Health Economics and Outcomes Research—held in Atlanta, Georgia. In their poster, “Development and Application of a Framework for Addressing New Challenges of Missing Data in Real-World Research,” Su and his coauthors built on existing principles for handling missing data and presented a new framework designed for today’s bumper crop of RWD.

Ontada, which is part of McKesson, uses RWD generated from multiple sources, including oncology practices that are members of The US Oncology Network. Studies based on these data aid in the development of tools for providers to use in practice. As more practices join The US Oncology Network, this provides both opportunities to access new data along and the challenge of figuring out how to integrate it with existing sources.

As Su explained in an interview with Evidence-Based Oncology, scientists who work with RWD have always had to deal with gaps in data. But increasingly diverse data sources call for moving beyond the traditional clinical trial framework, in which there were certain variables that will be targeted for collection.

“The landscape is changing,” Su said. “People are no longer collecting data based on a target.”

Instead of missing data being treated as an exception, it’s assumed there will be some missing data for almost every patient. That’s because so much data collection is voluntary, he said. A person’s watch can count their steps, for example, but it might not count them every day: The wearer might remove the watch or forget to charge it.

Some data are not collected by design, and different collection methods may be deployed, which calls for reconciling multiple sources. Said Su, “Do you still call this a ‘missing data’ issue?”

The framework gives scientists a road map for selecting data that will answer the questions of a particular study. There are 5 domains, which were applied across several large studies integrating RWD from electronic medical records, claims data, and data on social determinants of health. They are as follows:

Data relevance and representativeness. When integrating multiple data sources, researchers must examine what happens when there are data for ”overlapping patients,” meaning some individuals may have more than 1 health record.

Data quality. What is the data source? Were these data generated through technology, such as machine learning (ML) or natural language processing (NLP)? If so, are the data complete for their intended purpose? Are the data current enough to answer the questions being asked?

Data correlation. Investigators must assess how different collection methods for various elements affect the results. “If you train the machine with a good-quality data, the machine will return high-quality data for you to use,” Su explained. If data are obtained from multiple ML sources and “they are trained with different data,” then researchers need to account for this. Integrating data from multiple sources can create “a richer source for research” but then researchers must account for different data collection methods.

Using the same algorithm on ML sources that were “trained” differently can produce errors, Su said. “We have to pay attention.”

Intended collection. The framework authors warn, “Clinical expertise must be applied to differentiate between missing data and data not intended to be collected.”

Quantitative bias and sensitivity analyses. This should always be assessed and the poster features a flowchart on what questions investigators should ask as they work through the domains to ensure high quality and accuracy.

Data should be tested against existing sources, Su said, to weed out instances of data misclassification—where the machine “reads” data but does not correctly identify it, such as pulling the wrong date for a cancer diagnosis. Testing the robustness of the data is part of the bias analysis.

This process is important as new data sources are added, which can occur when a practice joins The US Oncology Network, for example. Challenges can arise if one practice relies on PDFs with handwritten notes that the machine has a hard time reading, while another has a more modern system for collecting notes. Artificial intelligence (AI) can help fill some of these gaps, he said.

Su emphasized that the framework presented at ISPOR is “not a complete contradiction” with existing methods for handling missing data. Rather, it builds on current research methods. “Those measures are still valid,” he said. “We just want to make sure because of new challenges, we have added a couple of layers of additional considerations.”

Reference
Su Z, O’Sullivan A, Dwyer K, Paulus J. Development and application of a framework for addressing new challenges of missing data in real-world research. Value Health. 2024;27(6):S1,MSR21.