The Borrowed Voice

When AI Speaks, Whose Words Does It Use?
#

Margaret sits across from her daughter Sarah, frustrated. The AI assistant on her tablet just suggested she might enjoy a podcast about cryptocurrency investing and intermittent fasting.

Margaret is 78. She has never owned cryptocurrency. She eats breakfast at 7am because her diabetes medication requires it. The suggestion makes no sense.

But it makes perfect sense if you understand where the AI learned to speak.

The Training Data Problem
#

Large language models learn from text. Billions of words scraped from the internet. The question nobody asks often enough: whose words?

The answer reveals something uncomfortable.

Common Crawl provides the foundation for most major AI models. It contains over 80% of the training tokens for GPT-3. Common Crawl is a nonprofit that archives billions of web pages. Sounds democratic. Sounds comprehensive.

It is neither.

Common Crawl captures what exists on the crawlable web. This excludes content behind paywalls, in private communities, in languages with less digital presence, from people who never had the access or inclination to publish online.

Reddit serves as a quality filter for many training datasets. OpenWebText extracts text from URLs that received at least three upvotes on Reddit. The logic seems sound: humans curated this content, so it must be good.

But who are these humans?

Reddit users skew heavily male. They skew white. They skew young. They skew toward certain countries, certain income levels, certain educational backgrounds. When Reddit becomes the arbiter of quality, these demographics become the implicit standard for what “quality” means.

BookCorpus contains 11,000 books. Unpublished books, scraped from self-publishing platforms. These authors represent a particular slice of humanity: those who wrote books, those who chose to self-publish, those who used specific platforms that happened to be crawled.

Margaret has never written a book. Her voice is not in the training data.

The Statistical Average That Does Not Exist
#

Here is the uncomfortable truth: AI models treat you as a statistical average of the people who created their training data.

When an AI generates a response, it draws on patterns learned from millions of documents. These patterns encode assumptions. The assumptions reflect the demographics of the data sources.

The “average person” implied by Reddit plus Common Crawl plus BookCorpus is younger than Margaret. More male than Margaret. More urban than Margaret. More digitally native than Margaret. More interested in cryptocurrency than Margaret.

This average person does not exist. No individual human matches the statistical composite. Yet every interaction with the AI begins from this phantom baseline.

Margaret receives cryptocurrency recommendations because the statistical average person in the training data talks about cryptocurrency. The AI has no mechanism to know that Margaret is not this person.

The Epistemology of Borrowed Knowledge
#

There is a philosophical problem here that goes deeper than demographics.

When you learn something from experience, you know it in a particular way. You know that fire is hot because you felt heat. You know that loss hurts because you grieved. This knowledge carries the texture of lived experience.

When you learn something from reading, you know it differently. You know that the Battle of Hastings was in 1066 because a book told you. This knowledge is borrowed. It lacks the texture of direct experience.

AI models have only borrowed knowledge. They have never experienced anything. Every word they generate comes from patterns in text created by others.

This creates a strange epistemological situation. The AI speaks with confidence about topics it has never encountered. It offers advice on grief without having grieved. It discusses aging without having aged. It recommends medications without having a body.

The borrowed knowledge works reasonably well when the AI talks to people similar to those who created the training data. The patterns transfer. The assumptions align.

The borrowed knowledge fails when the AI encounters someone outside that demographic center. Margaret’s experience of aging as a rural woman with diabetes and limited digital access does not appear frequently in Reddit posts or self-published novels.

The AI literally does not have the words.

What the Algorithm Cannot See
#

Consider what the training data excludes.

Oral cultures. Billions of people share knowledge through speech, not writing. Their wisdom never enters the training corpus. AI models cannot learn from traditions passed through storytelling, from recipes shared across generations, from medical knowledge held by community healers.

Private communication. The training data captures public text. It misses the conversations that matter most: the family discussions, the private struggles, the intimate moments where humans reveal their actual needs and preferences.

Marginalized voices. Those with less access to digital platforms, less time for online participation, less comfort with written English contribute less to training corpora. The AI learns their existence as statistical noise, not as full humans with complex needs.

Contextual knowledge. The training data contains words stripped from context. The AI does not know who wrote each document, what they were experiencing, what they needed. It learns patterns without understanding situations.

Margaret’s complexity disappears into averages. Her specific combination of identities, barriers, preferences, and needs matches no training document closely enough to generate appropriate responses.

The Personalization Paradox
#

AI companies promise personalization. They claim their systems will learn your preferences, adapt to your needs, serve you specifically.

But personalization built on biased foundations amplifies the bias.

If the base model assumes you match the training demographic, personalization learns your deviations from that assumed baseline. Margaret is not treated as Margaret. She is treated as “average person plus adjustments.”

This approach has a ceiling. The adjustments can only go so far. The underlying architecture still speaks with a borrowed voice trained on borrowed words from a narrow slice of humanity.

True personalization would start from Margaret as Margaret. It would build her profile from her actual context, not from her deviation from a statistical phantom. It would learn her preferences directly, not as corrections to assumptions that never fit.

Such personalization requires different architecture entirely.

The Philosophical Stakes
#

Why does this matter?

Because AI systems increasingly mediate our relationship with institutions. Healthcare systems use AI to triage, recommend, communicate. Financial systems use AI to assess, advise, decide. Educational systems use AI to teach, evaluate, guide.

If these systems see everyone through the lens of training data demographics, they will systematically misserve those outside the demographic center.

This is not a bug to be fixed with better data cleaning. It is a structural feature of how current AI systems work. They approximate minds. But whose mind are they approximating?

The answer: a composite ghost assembled from the writings of those privileged enough to contribute to the crawlable web.

What Would It Mean to Know Someone?
#

Imagine a different approach.

Imagine an AI system that began not with population statistics but with individual context. One that asked Margaret about her situation rather than assuming it. One that learned her preferences from her actual interactions rather than inferring them from demographic proxies.

Imagine a system that treated her intersecting identities as information to be understood, not as deviations from a norm to be corrected.

Imagine a system that acknowledged the barriers she faces rather than optimizing around an assumption that those barriers do not exist.

Such a system would not speak with a borrowed voice. It would learn a new voice for each person it served. Not perfect knowledge, but honest approximation grounded in actual relationship.

The technology for this exists. The architecture is possible. The question is whether we choose to build it.

Margaret’s Quiet Resistance
#

Margaret closes the tablet. She does not want cryptocurrency advice. She does not need intermittent fasting recommendations. She wants an AI that knows she takes her medication with breakfast, that her daughter Sarah helps with medical decisions, that she prefers phone calls to text messages, that her rural location limits her transportation options.

None of this appears in Reddit posts or Common Crawl archives.

Margaret’s life is not a deviation from the average. It is her life. The statistical phantom should adjust to her, not the reverse.

Until AI systems can start from the individual rather than the population, they will continue speaking with borrowed voices about borrowed experiences to people who were never included in the conversation.

The approximate mind approximates something. The question is what. The question is whom.

For Margaret, the answer matters.

When AI Speaks, Whose Words Does It Use?#

The Training Data Problem#

The Statistical Average That Does Not Exist#

The Epistemology of Borrowed Knowledge#

What the Algorithm Cannot See#

The Personalization Paradox#

The Philosophical Stakes#

What Would It Mean to Know Someone?#

Margaret’s Quiet Resistance#