Skip to main content
The Equity Test · BML-13.01

Summary: The AI That Hears You Wrong

Series 13: The Equity Test

By Syam Adusumilli · 5 min read · Cross-Cutting
Executive Summary Read the full article.

Denise Watkins is 68, a retired schoolteacher from Atlanta, and she is not losing her mind. Her neurologist has followed her for twelve years without concern. She walks three miles a day, runs a reading group at her church, and corrected her grandson’s algebra homework over the phone while making dinner. Last month. She is sharp, active, and fully herself.

Eighteen months ago, her health AI began flagging anomalies in her speech. The system monitors daily check-ins for changes in word-finding speed, sentence complexity, and fluency. The flags accumulated. After six months, the risk score crossed the threshold that triggers cognitive screening. The screening was administered by an AI assessment tool. She scored in the range that generates a referral to a memory clinic.

She spent a weekend she will not get back wondering whether the mind she trusted had started to leave without telling her.

Dr. Yolanda James, the clinician at the memory clinic, reviewed the AI assessments before Denise arrived. She had seen this before. She conducted her own evaluation. Denise is not cognitively impaired. She was never cognitively impaired. The speech patterns flagged as anomalies are features of African American Vernacular English: habitual “be,” consonant cluster reduction, copula deletion. Patterns Denise has used her entire life. Patterns that reflect the grammar of a language community with over thirty million speakers. The AI heard them as errors because the AI learned what normal sounds like from people who do not talk like Denise.

How this happens is not complicated, once stated. An AI speech analysis system learns from a training corpus. The composition of the corpus determines what the system treats as baseline. If the corpus consists primarily of standard American English from white, college-educated adults, the system will flag deviations from that pattern as potential signals of concern. It is not biased in the way people usually mean the word. It is accurate about the patterns it was trained on. It is inaccurate about everyone else.

The cognitive screening tools most widely used in clinical practice compound the problem. The Montreal Cognitive Assessment, the most commonly administered screening tool for mild cognitive impairment, was developed and validated with a predominantly white, English-speaking, well-educated sample. Research has documented consistent performance differences across racial, educational, and linguistic groups that reflect the test’s construction, not the cognitive capacities of the people taking it. Black older adults score lower on several standard assessments even after adjusting for education and socioeconomic factors. The gap narrows significantly when culturally appropriate norms are applied. An AI system built on these tools inherits every limitation they carry. The baseline is wrong for anyone who does not match the population the tool was calibrated on. Changes measured against a wrong baseline produce wrong conclusions.

Fall prediction algorithms carry the same structural problem. Systems that learn gait patterns and balance characteristics from training populations will be miscalibrated for bodies not represented in those populations. The evidence base here is early, but the concern is not: any system that learns normal from a narrow population will misperform for a broader one.

The populations most affected are the populations that already face the largest health disparities. A false referral from an AI cognitive screening system lands in a healthcare system where Black patients are already more likely to have their symptoms dismissed and their treatment delayed. Non-native English speakers face compounded error: language fluency gaps layer over whatever the system is trying to measure. People with hearing impairments whose speech reflects their hearing, not their cognition, are flagged by systems that were never taught the difference. Indigenous elders face the full range simultaneously, and the research documenting AI performance in Indigenous health contexts is thin — which is not the same as the absence of the problem.

The regulatory and research directions are moving in the right direction. The FDA’s framework for AI medical devices is heading toward requirements for diverse training data and bias testing before deployment. Dialect-aware natural language processing is advancing in academic research. Community-informed design is the exception in commercial health AI. The gap between the research and the products is measured in years, and the deployment speed of the products is outpacing the regulatory standards designed to govern them.

Dr. James has received five referrals in two years for Black patients whose AI-generated results reflected dialect and cultural factors rather than cognitive impairment. Five patients who spent days or weeks believing they might be losing their minds. Five families who reorganized their anxiety around diagnoses that did not exist. She wrote to the AI vendor after the third case. The vendor has not responded. She is documenting each case where she overrides an AI referral, because the research literature that will eventually quantify this problem will need the cases she is collecting now.

The minimum required is bias testing before clinical deployment, not after adverse events accumulate. A system not validated for performance across racial, linguistic, and cultural populations should not be deployed in clinical settings serving those populations. The training corpus for any speech analysis system used in health monitoring should include the speech patterns of the populations it will monitor. These are engineering specifications, not philosophical positions. Clinicians using AI health assessment tools for patients whose demographics differ from the training population should apply the standard Dr. James applies: review the AI’s work, question the referral, conduct their own evaluation. The machine is measuring the patient against someone else’s normal. Finding hers is the job.

Not every patient has a Dr. James. Not every wrong gets caught.

Read the full article at BlueMirror.life.