10 minute read

IMPROVING AI IN HEALTHCARE BY DETECTING AND FIXING BIAS

By DR PHILIPP DIESINGER

DR PHILIPP DIESINGER is a seasoned data scientist and AI expert with a PhD in theoretical physics from Heidelberg University. His career began with a postdoctoral position at the Massachusetts Institute of Technology (MIT). He has since held significant roles, including Head of Global Data Science at Boehringer Ingelheim and serving in a partner role at BCG.X. Currently, Philipp serves as a senior partner at Rewire, leading expert teams across the healthcare sector and beyond.

Artificial intelligence (AI) is increasingly shaping drug discovery, clinical trials, and healthcare decision-making. However, when AI models are trained on biased data, they can reinforce or even worsen existing healthcare disparities. Bias in healthcare data refers to systematic errors or skewed representations and it can lead to unfair or inaccurate outcomes for certain groups of patients. If left undetected, such biases can exacerbate existing health disparities and undermine trust in medical systems [1]. Global health authorities emphasise the importance of identifying and addressing bias in health data to ensure that advances like AI benefit all populations equitably [1]. Biases can enter at many stages – from how data are collected (e.g. which populations are included) to how algorithms interpret that data – making detection a critical step toward safer, more effective healthcare interventions.

In statistical terms, bias arises when a dataset is not representative of the true population, causing analyses or models built on that data to systematically deviate from reality [2]. In healthcare, this often means certain demographic groups (defined by race, gender, age, socioeconomic status, etc.) are underrepresented or misrepresented in the data. Social biases and systemic inequities can thus become encoded in healthcare datasets and algorithms, leading to worse outcomes for marginalised groups [2]. For example, an AI model trained mostly on data from one group may perform poorly on others, simply because it learned patterns that don’t generalise.

Detecting these biases involves scrutinising both the data and the outcomes of any algorithms using the data to ensure no group is being inadvertently harmed or overlooked.

Common sources of bias in healthcare data include the following:

1 | Sample Bias (Underrepresentation): Certain groups may be underrepresented in clinical trials, electronic health records, or genomic databases. For instance, over 90% of participants in genome-wide association studies have been of European descent, which limits the applicability of genetic findings to other ethnic groups [3]. Such sampling bias means medical knowledge and tools might be less effective for underrepresented populations.

2 | Measurement Bias: The way health data are obtained can introduce bias. A notable example is the pulse oximeter, a routine device that estimates blood oxygen. Studies found that pulse oximeters overestimate oxygen saturation in patients with darker skin, resulting in Black patients being nearly three times more likely than white patients to have low oxygen levels go undetected [4]. This bias in a common medical device illustrates how flawed data can directly impact care decisions.

3 | Algorithmic Bias or Feature Bias: Bias can also arise when algorithms use proxies in data that reflect societal inequities. A landmark investigation revealed that a widely used health risk prediction algorithm systematically underestimated the risk for Black patients compared to equally sick white patients because it used healthcare spending as a proxy for health. Due to unequal access to care (and thus lower healthcare spending for Black patients), the algorithm falsely judged Black patients to be healthier, resulting in fewer referrals for advanced care until the bias was detected and addressed [5] .

4 | Observer/Recording Bias: Human biases by clinicians or data recorders can creep into healthcare data. For example, if pain levels of certain patients are consistently underestimated due to implicit bias, those biases become part of the recorded data. Similarly, missing or inconsistent data (such as incomplete recording of a patient’s ethnicity or gender identity) can mask true patterns and make it harder to detect when outcomes differ among groups. Undetected bias in healthcare data can lead to inequitable care and poorer outcomes for certain groups. As the above examples show, biased data or algorithms might mean a serious condition goes unrecognised in a minority patient, or resources are allocated unfairly. These issues compound existing health disparities. For instance, biases in diagnostic tools or decision support systems can further disadvantage populations already facing barriers to care. Research has documented that AI systems, if unchecked, may amplify societal biases – one study warned that biased AI could misdiagnose or mismanage care for underrepresented groups, potentially leading to higher error rates or even fatal outcomes [2]. Detecting and correcting bias is thus essential to ensure patient safety and fairness. Moreover, from an ethics standpoint, leading organisations like the World Health Organization (WHO) stress ‘inclusiveness and equity’ as a core principle for health AI – meaning technologies should be designed and evaluated so they work well for all segments of the population [1] Detecting bias is a prerequisite to building trust in data-driven healthcare innovations and ensuring they improve health for everyone, not just a subset.

Detecting bias requires a systematic and proactive approach to analysing both data and model outcomes. Key methods include:

A | Thorough Data Audits and Representation

Analysis: Examine the composition of healthcare datasets to check whether key demographic groups are adequately represented. This involves comparing dataset demographics (e.g. race, gender, age distribution) against the relevant patient population. Any major imbalance or gap (such as a lack of data on a certain group) is a red flag for potential bias. For example, auditing a national health record database might reveal that ethnicity data is missing for 10% of patients [6]. Such gaps need to be identified, as they can hide disparities or make certain groups ‘invisible’ in analyses. Proper dataset documentation can help record these characteristics and alert researchers to possible biases in the data.

B | Performance Evaluation Across Subgroups:

When developing predictive models or AI in healthcare, it is crucial to evaluate their performance separately for different patient subgroups. Metrics like accuracy, error rates, sensitivity, or treatment recommendation rates should be compared across categories such as race, sex, or age. A significant disparity, for instance, an algorithm that has a much higher false-negative rate for women than for men, would indicate bias. In practice, the racial bias in the health risk algorithm [5] was detected by observing that Black patients with the same predicted risk score had worse actual health outcomes than white patients, prompting investigators to dig deeper. Similarly, checking a diagnostic AI on diverse test images might uncover that it performs poorly on images from older machines or from certain hospitals – pointing to bias from differing data sources or quality [7] . Regularly reporting model performance by subgroup is now considered a minimum requirement in best practices for medical AI [7]

C | Statistical Fairness Metrics: Researchers can apply formal fairness metrics to quantify bias in healthcare models. These metrics (borrowed from the field of machine learning fairness) include tests for ‘disparate impact’ (i.e. does a model’s predictions affect one group disproportionately?), ‘equalised odds’ (are error rates equal across groups?), or ‘calibration’ (are risk scores equally meaningful for each group?). For example, one might calculate whether a diagnostic test has the same sensitivity for minority patients as for others. If not, bias is present. Statistical tests, like chi-square or z-tests for differences in proportions, can flag when differences between groups are unlikely to be due to chance. Using such metrics provides a more objective way to identify bias beyond just anecdotal observations.

D | Reviewing Proxy Variables and Model Inputs: Detecting bias also means scrutinising which variables (or features) are used in algorithms and whether they could be acting as proxies for protected characteristics. In the case of the biased risk algorithm, the use of ‘healthcare cost’ as a proxy for health status was the culprit [5] . By reviewing model features, analysts can sometimes spot features that correlate strongly with race, gender, or other sensitive attributes. If a feature is contributing to unequal outcomes, it may need adjustment or removal. Feature importance analyses and sensitivity tests (evaluating model output after toggling certain inputs) are useful techniques to uncover proxy bias.

E | Human-in-the-Loop Evaluation: Finally, involving domain experts and stakeholders can aid bias detection. Clinicians, for example, might recognise when an AI’s recommendations consistently underserve a group of patients, triggering a closer look at the data. Patient advocacy groups can also provide insight into whether a model’s behaviour aligns with real-world experiences of different communities. This qualitative feedback can guide quantitative checks and vice-versa, creating a more robust bias detection process.

Identifying bias is only the first step – once detected, steps can be taken to correct or mitigate it. Approaches like collecting more diverse data, rebalancing datasets, or adjusting algorithms can help make healthcare data and tools fairer. For instance, if a dataset audit finds underrepresentation, targeted efforts can be made to include more data from the missing groups, such as launching studies in under-served communities or updating data collection practices. If performance disparities are found, developers might retrain models with bias mitigation techniques or introduce calibration factors to even out the outcomes. There is a growing movement in healthcare informatics to establish standards for fairness. An international consortium of researchers and clinicians recently published consensus recommendations on improving transparency and diversity in health datasets to combat algorithmic bias [8]. They call for measures like documenting dataset demographics, evaluating AI tools for bias before deployment, and involving diverse stakeholders in development [8]. Such guidelines echo the WHO’s principles and provide concrete steps for organisations to follow.

It is also worth noting that regulatory bodies and journals are increasingly urging bias evaluations as part of clinical AI validation [7]. The field is moving toward a future where claims of algorithm performance must be accompanied by evidence that the model was tested for bias and is safe for all patient groups. By integrating bias detection into the standard workflow – from data gathering to model training to deployment –healthcare providers can catch problems early and avoid propagating injustices. Detecting bias in healthcare data is essential to ensure equitable and effective care. By understanding the sources of bias and diligently auditing data and algorithms, healthcare researchers and professionals can uncover hidden disparities.

REFERENCES

[1] World Health Organisation (WHO). Ethics and governance of artificial intelligence for health: WHO guidance. Geneva: WHO; 2021. iris.who.int

[2] Norori N, Hu Q, Aellen FM, Faraci FD, Tzovara A. Addressing bias in big data and AI for health care: a call for open science. Patterns (NY ). 2021;2(10):100347. doi:10.1016/j.patter.2021.100347. pubmed.ncbi. nlm.nih.gov

[3] Bustamante CD, Burchard EG, De la Vega FM. Genomics for the world. Nature. 2011;475(7355):163-165. doi:10.1038/475163a. nature.co

[4] Sjoding MW, Dickson RP, Iwashyna TJ, Gay SE, Valley TS. Racial bias in pulse oximetry measurement. N Engl J Med . 2020;383(25):2477-2478. doi:10.1056NEJMc2029240. nejm.org

[5] Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447- 453. doi:10.1126/science.aax2342. science.org

[6] Pineda-Moncusí, M., Allery, F., Delmestri, A.  et al . Ethnicity data resource in population-wide health records: completeness, coverage and granularity of diversity.  Sci Data  11, 221 (2024). doi.org

[7] Chen, R.J., Wang, J.J., Williamson, D.F.K. et al. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat. Biomed. Eng  7, 719–742 (2023). doi.org

[8] Alderman JE, Palmer J, Laws E, et al. Tackling algorithmic bias and promoting transparency in health datasets: the STANDING Together consensus recommendations. Lancet Digital Health . 2024; (in press). doi:10.1016/S2589-7500(24)00224-3. thelancet.com

This article is from: