Before conducting a health equity data analysis, you’ll want to assess whether your data set gives a complete and accurate picture of the demographic groups you’re trying to compare. You can expect the category of age to be accurate. But other categories like race, ethnicity, sexual orientation, and socioeconomic status are less reliable because there’s less consistency in how and even if that demographic data is collected.
This challenge is compounded by the fact that there are few reliable data sets that can be used as benchmarks to validate the accuracy of the data you’re working with. For example, Medicare claims and enrollment data is typically relied on to provide accurate benchmarks for metrics like length of stay. But Medicare data isn’t as reliable for benchmarking demographic information.
Our experts’ experience
Before embarking on a race-stratified outcomes analysis, our experts Leena and Alex sought to first validate the reported race field in Medicare’s published enrollment tables. They compared race-stratified distributions of 65+ Medicare beneficiaries to similarly stratified estimates of the 65+ population in the U.S. census. They found the Hispanic population to be dramatically undercounted in Medicare’s enrollment data.
They found the main driver of this issue: While the census had separate questions for race and ethnicity, Medicare contains only six categories that combine race and ethnicity and does not allow multiple choices. This means that patients can be identified as, for example, either “white” or “Hispanic,” but not both. Therefore, many Hispanic Medicare patients are likely grouped under another category. This makes it difficult to accurately identify or measure a disparity for Medicare patients who are Hispanic.
How to navigate this challenge
It’s safe to assume there are some problems in your demographic data set. To move forward, investigate the sources of the data. Validate the data as much as possible to identify the specific ways in which your data is imperfect—even if it means going back to the source, form, or template with which the data was collected.
1. Understand the real-world context surrounding your demographic data and the ways that context introduces bias.
For example: How was the race field in your data set completed? How was the question phrased? Was it self-reported by the patient or filled in by someone else? At what point in the patient’s journey? Was it validated? Who is likely to be missing from or underrepresented in this data set?
Data is not objective. Data reflects the real-world environment and is influenced by how it’s collected. Accounting for that real-world context can help identify where the data may be biased. For example, if the race field was completed through a patient’s primary care physician, anyone who doesn’t have a PCP will be missed. Further, a well-resourced doctor’s office may have more capacity to sensitively gather patients’ demographic information than an understaffed clinic—and that will skew the information. And when race and ethnicity data isn’t self-reported, staff may draw conclusions based on someone’s skin tone rather than how a patient identifies. Ignoring this context and failing to acknowledge these biases can lead to skewed analyses and harmful decisions.
2. Go beyond your typical benchmark sources to validate your data set.
You cannot rely on typical foundational sources, like Medicare data, as a source of truth when it comes to certain demographic fields. You may need to use data sources you wouldn’t turn to normally. In the example above, when Alex and Leena identified issues with Medicare’s race distribution, they turned to the census as a comparison point.
You should also factor in your on-the-ground experience. For example, Natalie notes that using ZIP code as a proxy for a patient’s social determinants of health (such as income level or access to healthy food) is not always reliable because neighborhoods can vary significantly within a single ZIP code. Use your personal experience as a local resident to think critically about the details of your data set and reach a higher level of precision.
The goal is not to search for one benchmark that agrees with your data set. Rather, look at a range of data sets—including qualitative inputs—and how their demographic distributions agree or disagree to shed light on how representative your sample is.
3. Get specific about how your data set is bad or incomplete.
Once you’ve found the inevitable imperfections in your data, go beyond describing your data as “bad” or “imperfect”. This language makes the problem feel insurmountable. (After all, how can you make “good” comparisons based on “bad” data?) However, pinpointing the specific ways in which the data is imperfect—“Medicare claims data underrepresents Hispanic patients as race/ethnicity data is not accurately and discretely captured from source form through final data set”—pushes you toward action rather than frustration.
4. Improve the accuracy of your demographic data collection going forward.
We’re unlikely to see the government enforce standards for how to collect demographic information anytime soon. As a result, every provider organization, payer, and life sciences company is responsible for ensuring their data sets are as accurate and complete as possible.
Involve your data team, clinicians, and health equity experts in designing how your organization collects demographic data. Clinicians can make sure the process will integrate smoothly into workflow, health equity experts can illuminate important societal context, and data experts can minimize bias introduced during the collection process.
For more guidance on improving data collection and analysis protocols to minimize bias, use the Health equity measurement discussion guide.