The Equity Test · BML-13.01

The AI That Hears You Wrong

Series 13: The Equity Test

By Syam Adusumilli · 9 min read · Cross-Cutting

Article Cross References (5) References (4)

Denise Watkins is 68 years old, a retired schoolteacher from Atlanta, and she is not losing her mind. Her neurologist has followed her for twelve years. He has never expressed concern. She walks three miles a day, runs a reading group at her church, and last month corrected her grandson’s algebra homework over the phone while making dinner. She is sharp, active, and fully herself.

Eighteen months ago, her health AI began flagging anomalies in her speech. The system monitors daily check-ins for changes in word-finding speed, sentence complexity, and speech fluency. The flags accumulated. After six months, the system’s risk score crossed the threshold that triggers cognitive screening. The screening was administered by an AI assessment tool. Denise scored in the range that generates a referral to a memory clinic.

The referral arrived on a Thursday. She read it twice, called her daughter, and spent a weekend she will not get back wondering whether the mind she trusted had started to leave without telling her.

Dr. Yolanda James, the clinician at the memory clinic, reviewed the AI assessments before Denise arrived. She had seen this before. She conducted her own evaluation. Denise is not cognitively impaired. She was never cognitively impaired. The speech patterns the AI flagged as anomalies are features of African American Vernacular English. Habitual “be.” Consonant cluster reduction. Copula deletion. Patterns Denise has used her entire life, patterns her mother used, patterns that reflect the grammar of a language community with over thirty million speakers. The AI heard them as errors because the AI learned what normal sounds like from people who do not talk like Denise.

How Speech Analysis Systems Learn
#

An AI speech analysis system learns from a training corpus: a large collection of transcribed speech from which the system extracts the patterns it treats as baseline. The composition of the corpus determines what the system considers normal and what it considers anomalous. If the corpus consists primarily of standard American English spoken by white, college-educated adults, the system will treat that speech as the norm and flag deviations from it as potential signals of concern.

Features of AAVE will register as anomalies in such a system. So will features of Appalachian English, Chicano English, Native American English varieties, and the speech patterns of non-native English speakers. The system is not biased in the way people usually mean the word. It is accurate about the patterns it was trained on. It is inaccurate about everyone else.

The problem compounds when the system is used for health monitoring. A speech pattern flagged as a word-finding delay is not a word-finding delay. A grammatical feature flagged as a syntactic error is not a syntactic error. But the system cannot tell the difference between a dialect feature and a cognitive symptom, because it was never taught that the dialect exists.

The Cognitive Screening Problem
#

The cognitive screening tools most widely used in clinical practice were developed with normative data from specific populations. The Montreal Cognitive Assessment, the most commonly administered screening tool for mild cognitive impairment, was developed and validated with a predominantly white, English-speaking, well-educated sample. Research has documented consistent performance differences across racial, educational, and linguistic groups that reflect the test’s construction, not the cognitive capacities of the people taking it.

Black older adults score lower on several standard cognitive assessments even after adjusting for education and socioeconomic factors. The gap narrows significantly when culturally appropriate norms are applied. It narrows further when testing is administered in ways that account for the test-taking experience, comfort with clinical environments, and the historical relationship between Black communities and medical institutions that have not always acted in their interest.

An AI system that uses these screening tools to establish cognitive baselines inherits every limitation the tools carry. The baseline is wrong for anyone who does not match the population the tool was calibrated on. Changes measured against a wrong baseline produce wrong conclusions. Denise’s screening score was not evidence of impairment. It was evidence that the test was measuring her against someone else’s normal.

Fall Prediction and Body Type
#

The bias in speech analysis has a growing research literature behind it. The bias in fall prediction algorithms is less studied but structurally identical. Fall prediction systems learn gait patterns, balance characteristics, and movement signatures from training populations. If those populations lack diversity in body type, movement culture, and physical conditioning patterns, the system’s predictions will be calibrated for the bodies it studied and miscalibrated for the bodies it did not.

Research is beginning to examine whether fall prediction algorithms trained primarily on white older adults perform differently for Black, Hispanic, and Asian older adults whose average body composition, gait patterns, and movement characteristics may differ from the training population. The evidence base is early. The structural concern is not. Any system that learns normal from a narrow population will misperform for a broader one.

The Populations Most Affected
#

The groups most affected by training data bias in health AI are the groups that already face the largest health disparities.

Black older adults whose speech patterns include AAVE features face misidentification in speech-based cognitive monitoring. The health system they enter through an AI-generated referral is the same system where Black patients are already more likely to have their pain underestimated, their symptoms dismissed, and their treatment delayed. The false referral lands in a context that is already hostile.

Non-native English speakers face compounded error. Their speech-based AI interactions layer language fluency gaps over whatever the system is trying to measure. A Japanese American elder whose English is fluent but accented may trigger the same false flags as a person experiencing genuine word-finding difficulty. The system cannot distinguish accent from impairment because it was not trained to hear the difference.

People with hearing impairments whose speech patterns differ from hearing norms face a version of the same problem. Their speech reflects their hearing, not their cognition. The AI does not know this unless someone taught it, and in most current systems, no one did.

Indigenous elders whose speech patterns, movement characteristics, and relationship to clinical environments differ from training populations face the full range of these biases simultaneously. The research documenting AI performance in Indigenous health contexts is thin. The absence of research is not the absence of the problem.

What Is Being Done
#

The FDA’s regulatory framework for AI medical devices is moving toward requirements for diverse training data and bias testing before clinical deployment. The direction is right. The pace is not matched to the deployment speed of the products. AI health monitoring systems are reaching consumers faster than the regulatory standards designed to govern them are being finalized.

Bias testing protocols exist but are inconsistently applied. Some AI health companies conduct demographic performance validation before deployment. Many do not. There is no consistent requirement, no standard reporting format, and no public registry of testing results that would allow a consumer or a clinician to evaluate whether a system has been validated for their patient population.

Dialect-aware natural language processing, speech analysis systems that recognize the difference between a dialect feature and a cognitive symptom, is advancing in academic research. It is not yet available in any commercial AI health monitoring product that has achieved clinical deployment at scale. The gap between the research and the product is measured in years.

Community-informed design, the practice of developing AI health tools with the communities the tool will serve rather than for them, is the exception in commercial health AI. Pilot programs exist. Standard practice does not.

What Dr. James Did
#

Dr. Yolanda James conducted her own evaluation. She recognized the pattern because she had seen it before. Five times in two years, she has received referrals from AI cognitive screening systems for Black patients whose test results reflected dialect and cultural factors, not cognitive impairment. Five referrals, five patients who spent days or weeks believing they might be losing their minds, five sets of families who reorganized their anxiety around a diagnosis that did not exist.

She wrote to the AI vendor after the third case. The vendor has not responded. She is now required by her institution to document each case where she overrides an AI-generated referral, which means the system creates more work for the clinician who catches its error. She is building the adverse case log because someone has to. The research literature that will eventually quantify this problem will need the cases she is collecting now.

Dr. James does not want to eliminate AI health monitoring. She wants it to work for her patients. The distance between those two positions is a training dataset and a validation protocol. It is not a large distance. It has not been crossed.

The Systematic Fix
#

Bias testing before clinical deployment, not after adverse events accumulate. This is the minimum. A system that has not been validated for performance across racial, linguistic, and cultural populations should not be deployed in clinical settings serving those populations. The standard is not novel. It is the same standard applied to pharmaceuticals, which must demonstrate efficacy and safety across demographic groups before approval. AI health systems have been held to a lower standard because the regulatory framework has not yet caught up.

Diverse training data as a requirement, not a recommendation. The training corpus for any speech analysis system used in health monitoring should include the speech patterns of the populations it will monitor. This is an engineering requirement with a clear specification. It is not yet a consistent industry practice.

Community validation with the populations the system will serve, before deployment in those communities. The people who will be monitored by the system should participate in its design, its testing, and its validation. If the system cannot pass muster with a community advisory board that includes a retired schoolteacher from Atlanta who speaks the way her family has spoken for generations, the system is not ready for her.

Until these standards are met, every clinician using AI health assessment tools for patients whose demographics differ from the training population should apply the standard Dr. James applies. Review the AI’s work. Question the referral. Conduct your own evaluation. The machine is measuring your patient against someone else’s normal. Your job is to find hers.

Denise Watkins is 68. She walks three miles a day. She runs a reading group. She corrects algebra homework over the phone. She is sharp, active, and fully herself. An AI that could not hear her correctly told her she might not be. The wrong was correctable. Not every patient has a Dr. James. Not every wrong gets caught.

How this article connects to others in Blue Mirror.

BML-01.02 (The Baseline That Saves Your Life) describes the health AI baseline monitoring system as a tool for everyone; this article documents what happens when the baseline is established using normative data from a population that does not include the person being monitored, making it the equity test of the assumption 01.02 builds on.

BML-04.03 (What AI Can See That You Cannot) presents speech pattern and cognitive monitoring AI as detection tools for early cognitive change; this article documents the specific failure mode when those tools were not trained on the speech patterns of the person they are monitoring, and the clinical harm that misidentification produces.

BML-13.03 (The AI That Doesn't Speak Your Language) covers the parallel failure mode for bilingual elders: where this article documents how dialect features within English trigger false cognitive flags, 13.03 documents how second-language performance under testing conditions produces false cognitive flags for a different mechanism — the two articles together cover the full range of speech and language failures in AI cognitive monitoring.

BML-13.SYN (Design With, Not For) synthesizes all four equity failures this series documents and traces the design process that produces them; Denise Watkins's story opens that synthesis as the first of four named exclusions, and the reader who has spent time with the full account of her referral and Dr. James's documentation will understand why the synthesis opens with the room's composition rather than the product's code.

BGM's coverage of algorithmic bias in healthcare (BGM-9B, The Bias in the Machine) documented the structural conditions that allow biased AI to deploy at scale — the regulatory gaps, the incentive structures, the absence of diverse training data requirements; readers who want the diagnosis of why bias persists in AI healthcare systems will find that foundation in BGM.

Sources cited in this article.

Nasreddine, Ziad S., et al. "The Montreal Cognitive Assessment, MoCA: A Brief Screening Tool for Mild Cognitive Impairment." Journal of the American Geriatrics Society, vol. 53, no. 4, 2005, pp. 695-699.
Manly, Jennifer J., et al. "Effect of Literacy on Neuropsychological Test Performance in Nondemented, Education-Matched Elders." Journal of the International Neuropsychological Society, vol. 5, no. 3, 1999, pp. 191-202.
Obermeyer, Ziad, et al. "Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations." Science, vol. 366, no. 6464, 2019, pp. 447-453.
Koenecke, Allison, et al. "Racial Disparities in Automated Speech Recognition." Proceedings of the National Academy of Sciences, vol. 117, no. 14, 2020, pp. 7684-7689.

How Speech Analysis Systems Learn#

The Cognitive Screening Problem#

Fall Prediction and Body Type#

The Populations Most Affected#

What Is Being Done#

What Dr. James Did#

The Systematic Fix#