Member-only story

AI Companies Have Exhausted Human-Made Data. They’re Now Using AI to Make More

They scraped the entire web and now sell products that pollute it

5 min readOct 19, 2023

Here’s a not-so-surprising consequence of commercializing models like ChatGPT: We, the users, are polluting the internet to the point that neither we nor companies can trust the otherwise “real” human-made data on the web anymore.

This is a very big hurdle — one that never materialized until now. Companies are in urgent need of a new source of data to train their next-generation models.

This great flood of AI-generated pollution is the inevitable effect of the collective use of this technology. We are trading a long-term inconvenience (e.g., internet data becomes highly unreliable), to put it lightly, for a short-term benefit (e.g., a productivity boost at work). Is it worth it?

One solution that immediately comes to mind is AI detectors, classifiers that can distinguish between human-made and AI-made with high accuracy and reliability. I’ve written about this a few times in the past — the short answer is they won’t work.

Besides independent researchers, companies like OpenAI and Google have tried to build a detector they could trust, to no avail. Could it be that they haven’t tried hard enough? In another article, I explained why companies would benefit just as much as users from reliable AI detectors:

“Random AI-generated text spread all over the internet by millions of daily users is bad for teachers but also for the companies making the models. Yet, regardless of such a strong incentive to get it right, not even the most talent-dense, deep-pocketed, AI-savvy companies have been capable of solving this problem infallibly and reliably (OpenAI removed its AI detector for lack of accuracy, which I consider the adequate move; the page now returns a “not found” error).”

Companies need to make this work for their own benefit. In the face of undetectability, and if they were to use polluted data (i.e., data that’s believed to be human-made but is, at least partly, AI-generated), they would risk degrading the performance of their models, what’s known as “model collapse.”

AI Companies Have Exhausted Human-Made Data. They’re Now Using AI to Make More

They scraped the entire web and now sell products that pollute it

Written by Alberto Romero

Responses (1)