Can You Trust Your Real-World Data?


By Ken Tarkoff


Understanding the Questions You Should Be Asking to Assess The Quality of Your RWD

In my last blog post, I talked about the three key imperatives that the industry faces in realizing the full potential of real-world evidence (RWE) to impact care––promising to dig a little deeper in subsequent blog posts. So let’s start this time with the first imperative: ensuring that the underlying real-world data is capable of answering the questions that matter most. 

First, and foremost, when using RWE, you need to make sure you can trust the answers you are getting from the underlying data. In my experience working in healthcare interoperability for twenty years, when asked about the quality of real-world data (RWD), questions can range from quantitative in nature (e.g. numbers, patient counts, and data elements) to qualitative (e.g. data quality and population similarity). The bottom line is that you need both, though not enough organizations are demanding high quality or trustworthiness from their RWD.

Cancer care has become incredibly complex as a myriad of new biomarkers, targeted therapies, and immunotherapies have been introduced to an already complex array of data points that inform a patient’s care. RWD has the potential to support stakeholders across the healthcare continuum by:

  • Helping clarify how therapies perform in real world populations that are underrepresented in clinical trials, e.g. in minority communities, patients with comorbidities, and the aging population
  • Helping providers identify and close gaps in care to ensure every patient is given their best shot at managing their disease effectively

In order to address these challenges, however, RWD must be sufficiently high quality to provide trustworthy answers. So, if you are a researcher considering the quality of a real-world dataset that will be used to derive critical insights or identify unmet needs, what questions should you be asking?

Understanding data sources: What data sources do you leverage and how much overlap is there between them? How do you bring these sources together to develop a complete picture of the patient journey? How do you measure the quality of your sources?

All data sources have limitations, particularly in healthcare. Even carefully curated clinical trial datasets have gaps. Real-world data, which is derived from the natural documentation of patient care, is no exception. Data is typically siloed in multiple distinct clinical systems spread throughout the health system, and much of the richness of the patient record is often trapped in free text notes rather than documented in structured fields––making it effectively invisible for real-world databases fed by structured feeds alone. Curated sources such as hospital tumor registries are able to provide some of those difficult-to-find data points, but they too are limited in both the scope of the patient journey, and the portion of the population that is included.

At Syapse, we address this complex challenge by not just understanding the strengths and limitations of individual data sources, but we also leverage that information to bring multiple disparate, overlapping data sources together into a single, comprehensive view of the patient journey.  

Our approach requires a sophisticated real-world data engine able to:

  • Scalably normalize and harmonize multiple datasets
  • Measure quality by source and element
  • Employ complex merge logic by element 
  • Develop a harmonized view while maintaining traceability
  • Measure accuracy and completeness throughout the process

It should be noted that gathering multiple, large datasets onto a single platform is not the same as linking and integrating information into a single patient record. Accumulating and linking data sources only provides incremental value if the patient populations significantly overlap, allowing the sources to jointly populate the records of the same patients. Layering multiple sources together is a more complex, challenging undertaking than relying on a single RWD source for insights, but it enables us to develop a much more complete picture of each patients’ cancer care journey. Our first peer-reviewed manuscript quantifying the impact of this approach for mortality is due to be published in the coming weeks. Keep any eye out for further discussion once it is published.

Understanding manual curation: Are you leveraging manual curation to close data gaps? If so, what is your source material? How are you ensuring consistency and enabling scalability?  

Leveraging human intelligence to manually curate data elements that are typically not structured is a valuable but resource-intensive technique to augment real-world data sets. Due to the expense of the approach, it is most often used to support custom studies on small patient populations that can be pre-identified using commonly structured data elements. 

To enhance efficiency when leveraging manual curation to develop larger, more generally useful real-world datasets, potential data elements may be parsed out of free-text notes and quickly verified by curators that focus on only one data element at a time. Although it enhances efficiency, this technique limits both the breadth of data that may be curated (only data contained within the platform, not the full EHR), and the context that curators have on other aspects of the patient journey (only viewing one data element for a patient at a time).

At Syapse, we solve the scalability problem a little differently. We employ an in-house team of highly skilled Certified Tumor Registrars (CTRs) with access to the complete patient record. We take advantage of the structured data feeds for which we have verified quality and can use to reduce the manual curation burden. Our CTRs, however, focus on the entire record for a patient at any given time, which allows them to do much more than pull out the hard-to-find elements that are often hiding in free text notes. CTRs act as detectives, piecing together the comprehensive patient journey and ensuring that data anomalies and inconsistencies (e.g. conflicting treatments or missing tests) are resolved. This holistic approach elevates the quality of the entire patient record.

Understanding artificial intelligence: How are you validating the artificial intelligence (AI) algorithms you use to enhance your RWD? Can you explain how you came to specific conclusions, or do your algorithms take a black box approach?

AI has tremendous potential to transform the development and use of RWD. The field has made incredible strides in the last few years, to the point that things that were “impossibly hard” only a few years ago are now routine. However, routine does not equate to trustworthy.

There is a tendency for RWD developers to over-rely on AI algorithms to “fill in the gaps” of incomplete datasets. AI is often described as “layered on top” of large, incomplete datasets in order to impute data points and provide a more complete picture. These algorithms, however, are often black boxes with limited “explainability.” Essentially, there is little, if any, visibility into how certain conclusions were drawn, or ability to evaluate their provenance or trustworthiness. It becomes unclear which values are imputed based on pattern recognition, and which ones were documented in the data.

At Syapse, we focus on the targeted application of validated AI algorithms to close gaps based on documented data points. We focus the application of AI on very specific data improvement goals, rather than layer AI on top of an entire RWD dataset as a blanket. We develop applications of natural language processing (NLP) and machine learning (ML) tailored to our specific data sources and goals, and then leverage our trusted data sources (including, but not limited to, manually curated data) to validate the outputs of our algorithms before integrating them into our platform.

Creating a Virtuous Cycle to Build High Quality Data at Scale

Establishing high quality data at scale––like building your dream home––requires the integration of a lot of moving parts. Most players in the industry focus on one, or perhaps two, of the above approaches to develop RWD. At Syapse, our approach harmonizes across all three techniques in concert, creating a self-reinforcing virtuous cycle that supports high quality RWD at greater and greater scale.

  • Technology-enabled curation provides trusted data to evaluate the quality of both structured data feeds and AI algorithms
  • Scaled data feeds contribute free text notes that can be extracted via NLP and high quality structured fields that can reduce the curation burden
  • Each newly validated AI algorithm reduces the burden of manual curation required to develop a comprehensive view of the patient journey 

Now you have the questions you need to evaluate a prospective real-world dataset, but what are you doing to move your data from an impactful study to impacting patient care? More on this topic in my next blog.