The Borrowed Voice — Summary · The Approximate Mind

When an AI system suggests cryptocurrency investing and intermittent fasting to a 78-year-old woman who has never owned cryptocurrency and eats breakfast at 7am because her diabetes medication requires it, the mismatch is not a glitch. It is a structural feature of how the system was built.

Large language models learn from text. The uncomfortable question is whose text. Common Crawl, which provides most training tokens, captures the crawlable web — excluding content behind paywalls, in private communities, in languages with less digital presence, from people who never published online. Reddit, used as a quality filter for many datasets, skews heavily male, white, young, and toward certain income and education levels. When Reddit becomes the arbiter of quality, these demographics become the implicit standard for what “quality” means. Margaret has never written a Reddit post. Her voice is not in the training data.

The result: AI models treat every user as a statistical average of the people who created their training data. This average person does not exist, but every interaction begins from this phantom baseline. Margaret receives cryptocurrency recommendations because the statistical composite talks about cryptocurrency. She is not treated as Margaret. She is treated as “average person plus corrections.”

The epistemological problem runs deeper. AI models have only borrowed knowledge — every word they generate comes from patterns in text created by others. This borrowed knowledge works reasonably well when the AI talks to people similar to those who created the training data. It fails when the AI encounters someone outside the demographic center. Margaret’s experience as a rural woman with diabetes and limited digital access does not appear frequently in self-published novels or Reddit threads. The AI literally does not have the words.

What would it mean to start from Margaret as Margaret, rather than from the phantom baseline? To build her profile from her actual context rather than from her deviation from assumed norms? The technology for this is possible. The architecture exists. The question is whether we choose to build it — or continue speaking with borrowed voices about borrowed experiences to people who were never included in the conversation.