3 Sources of Real-World Data

3.1 Overview

This chapter introduces the primary sources of real-world data (RWD), each with unique features, strengths, and limitations. Because these data were not collected for research purposes, understanding their origin—or data provenance—is essential to designing robust studies, selecting appropriate data for a given question, and interpreting findings responsibly.

3.2 Learning Objectives

By the end of this chapter, readers will be able to:

Understand the major categories of RWD sources.
Compare the advantages and limitations of different data sources.
Identify appropriate use cases for each source of RWD.
Recognize the implications of data provenance for study design and interpretation.

3.3 Electronic Health Records (EHR)

Electronic health records are a foundational source of real-world data. They are generated during the routine delivery of care and contain both structured elements—such as diagnosis codes, vital signs, lab results, medication lists, and procedure codes—and unstructured text, including clinical notes, imaging reports, and narratives.

Unlike claims data, which primarily reflect services rendered and billed, EHRs capture both the intent of care and, in many cases, the actions that were actually carried out. For instance, a clinician may order a medication (documenting their intent), and the Medication Administration Record (MAR) may indicate whether and when that medication was administered (reflecting the completed action). This dual representation allows for richer analysis of clinical decision-making and care delivery pathways, which is especially valuable for studies of quality, safety, and adherence.

EHR data are typically confined to a specific healthcare system, meaning they may not include care received at outside facilities unless explicitly exchanged. This can result in missing data for patients who receive fragmented care. A large number of data repositories are now being established to augment the deep granularity of EHR data with wide population coverage. In addition, because EHRs are not designed for research, they often contain inconsistencies, variation in documentation practices, and unstructured information that requires preprocessing to be useful for analysis.

EHRs also vary widely in their structure and content across systems and vendors, limiting interoperability. Moreover, key elements may be absent or only partially recorded—for example, over-the-counter medications, social determinants of health, or reasons for non-treatment.

However, EHRs are increasingly linked to claims or pharmacy fill data through a variety of mechanisms. Internally, billing records from hospitals or clinics may be accessible alongside clinical documentation, enabling the study of both care provided and charges incurred. Externally, health systems may partner with data vendors to access pharmacy benefit data, allowing researchers to identify whether medications ordered within the EHR were actually dispensed by retail or mail-order pharmacies. These linkages help bridge the gap between what was ordered, what was billed, and what was actually received by the patient.

For studies that require detailed clinical information—such as disease phenotyping, adverse event detection, care pathway analysis, or treatment response—EHRs provide unmatched depth and context. However, careful attention to data quality, completeness, and provenance is essential to ensure valid inferences.

3.4 Administrative Claims Data

Administrative claims data are generated during the process of billing and reimbursement for healthcare services. Unlike EHRs, which document the full clinical encounter, claims are structured around the submission of charges for diagnoses, procedures, and medications, and therefore reflect the services that were actually delivered and paid for. The Centers for Medicare and Medicaid Services (CMS) maintain one of the most commonly utilized claims databases in the United States and provide freely accessible synthetic data files for researchers to learn and explore claims data structure.

Because they are used for reimbursement, claims data capture performed actions—for example, procedures completed, medications dispensed, and services rendered. This makes them particularly reliable for assessing healthcare utilization, treatment patterns, and costs. Importantly, they do not capture clinical intent—there is no record of what a clinician considered or ordered unless that action resulted in a reimbursable service. For instance, a medication prescribed in the EHR will only appear in claims data if it was actually filled and billed through a pharmacy benefit manager.

Claims data are typically highly structured and use standardized coding systems such as ICD (diagnoses), CPT or HCPCS (procedures), and NDC (medications). This standardization makes claims well suited for large-scale, multisite, or longitudinal studies, especially those focused on service utilization, payment, or population health management.

Another key strength of claims data is that they often capture care delivered across multiple health systems. A patient’s claims history, particularly in datasets from large payers like Medicare or Medicaid, may include encounters from hospitals, clinics, pharmacies, and other providers—regardless of affiliation—so long as the services were reimbursed. This gives claims a broader view of healthcare delivery than a single EHR system can typically provide.

However, claims data come with important limitations. They lack clinical granularity: vital signs, laboratory results, disease severity, imaging findings, and patient-reported outcomes are not available. In addition, claims data usually lack detailed timestamps. Diagnoses and procedures are often recorded for the entire encounter, with little information about their precise timing or sequence. For example, a patient hospitalized for two weeks may have a billing claim that includes a diagnosis of venous thromboembolism (VTE), but the claim won’t indicate whether the VTE was present on admission or occurred later during the hospital stay. This limitation makes it difficult to establish temporal relationships or assess causality without supplementary data sources.

Furthermore, because claims are submitted primarily for payment, they can be subject to miscoding, omissions, or up-coding, either intentionally or through automated billing software. Elements originally documented in the EHR may be altered or added during the billing process.

Many healthcare organizations now link claims and EHR data internally, allowing researchers to study both the clinical context and the financial record of care. Some organizations also purchase or license external pharmacy claims data, enabling insight into whether prescriptions written in the EHR were filled elsewhere. These linkages can provide a more complete picture of patient care, adherence, and outcomes.

Claims data are particularly valuable for studies focused on healthcare utilization, adherence, economic evaluations, comparative effectiveness, and longitudinal follow-up across sites of care. However, for research questions that depend on fine-grained clinical detail, event timing, or insight into provider intent, claims should be supplemented with EHR or other data sources when possible.

3.5 Patient-Generated Health Data (PGHD)

Patient-generated health data refer to health-related data collected directly from individuals, typically outside of clinical settings. This includes data from wearable devices (e.g., step counters, heart rate monitors), mobile health apps, home monitoring tools (e.g., blood pressure cuffs, glucose meters), and patient surveys or diaries.

PGHD offers real-time insights into behavior, symptoms, and adherence in the patient’s everyday environment. It enables continuous, fine-grained monitoring and can fill critical gaps in understanding the patient experience between healthcare encounters.

Despite its promise, PGHD poses several challenges. Data quality can be variable, devices and platforms lack interoperability, and integration into clinical workflows or EHR systems is often limited. Questions of accuracy, standardization, and reliability must be addressed before PGHD can be routinely used in research or care.

PGHD is best suited for studies of health behavior, self-management, and intervention adherence, especially when combined with clinical data sources.

3.6 Other Sources

In addition to the primary sources above, several other types of data can contribute to real-world evidence generation:

Public health surveillance systems collect population-level data on disease incidence, outbreaks, and health behaviors.
Social determinants of health (SDOH) datasets capture contextual factors such as housing, education, and food security, which can be critical for understanding health disparities.
Biobanks and genomic databases link biological samples with clinical and demographic information, enabling genotype-phenotype studies and precision medicine research.
Data from pragmatic clinical trials and learning health systems offer hybrid approaches, combining the rigor of trials with the flexibility and scale of real-world data infrastructure.

These sources often serve as valuable supplements or linkage targets, expanding the breadth and depth of available information for clinical and translational research.

3.7 Matching the Data Source to the Research Question

Choosing the right source of real-world data is one of the most critical decisions in study design. The ideal data source depends on the research question, the variables needed, the expected frequency of outcomes, and the importance of patient follow-up and timing. No dataset is perfect, and each comes with tradeoffs between sample size, data richness, and completeness.

For example, a study on the incidence of secondary skin cancers in transplant patients—especially among those with skin of color—requires reliable documentation of transplant history, dermatologic diagnoses, race and ethnicity, and long-term follow-up. While large registries like SEER or NCDB offer broad coverage and high-quality cancer data, they often lack transplant history and may not distinguish skin cancer subtypes like squamous or basal cell carcinoma. EHRs may capture these details but only for patients who continue to receive care within the same health system, potentially leading to incomplete follow-up. Claims data, by contrast, may offer broader capture of services across settings, but lack clinical detail and precise timestamps.

Timing is another key consideration. Claims data typically summarize services at the level of an encounter or hospital stay, often without fine-grained timing of when diagnoses occurred. For example, if a venous thromboembolism is coded during a two-week hospitalization, it’s unclear whether it was the reason for admission or developed later as a complication. EHRs often provide more precise time-stamped documentation, such as when a medication was ordered and administered, which can support more detailed temporal analyses.

In general, EHRs offer clinical depth, claims provide population scale and continuity, and registries ensure data quality for focused conditions. Understanding these differences—and aligning the data source to the needs of the question—is essential for designing rigorous and feasible real-world evidence studies.

Data Source	Strengths	Limitations	Best Use Cases
EHR	Rich clinical data, longitudinal	Incomplete, messy, variable	Clinical outcomes, phenotyping
Claims	Large scale, consistent coding	Limited clinical granularity	Utilization, economic outcomes
Registries	Focused, high-quality data	Selection bias, limited population scope, limited number of variables	Rare diseases, descriptive studies, linking to more comprehensive datasets
PGHD	Real-world behavior, continuous data	Variable quality, integration challenges	Adherence, behavior monitoring

3.8 Summary

Each source of real-world data has its own strengths and limitations, shaped by how and why the data were originally collected. Understanding these differences is essential for selecting the right data for a given research question, designing methodologically sound studies, and interpreting findings appropriately. In some cases, combining data from multiple sources may enhance the validity and generalizability of results—but this comes with increased complexity, requiring careful attention to linkage, harmonization, and potential sources of bias.

3.9 Suggested Readings

Selection of Data Sources Developing an Observational CER Protocol: A User’s Guide, AHRQ
Electronic Health Record Data for Research at UCSF (Video)
Gliklich et al. Registries for Evaluating Patient Outcomes: A User’s Guide, AHRQ.
Synthetic Medicare Enrollment, Fee-for-Service Claims, and Prescription Drug Event