AI Companies Have Exhausted Human-Made Data. They’re Now Using AI to Make More

They scraped the entire web and now sell products that pollute it

Alberto Romero
5 min readOct 19, 2023

Here’s a not-so-surprising consequence of commercializing models like ChatGPT: We, the users, are polluting the internet to the point that neither we nor companies can trust the otherwise “real” human-made data on the web anymore.

This is a very big hurdle — one that never materialized until now. Companies are in urgent need of a new source of data to train their next-generation models.

This great flood of AI-generated pollution is the inevitable effect of the collective use of this technology. We are trading a long-term inconvenience (e.g., internet data becomes highly unreliable), to put it lightly, for a short-term benefit (e.g., a productivity boost at work). Is it worth it?

One solution that immediately comes to mind is AI detectors, classifiers that can distinguish between human-made and AI-made with high accuracy and reliability. I’ve written about this a few times in the past — the short answer is they won’t work.

Besides independent researchers, companies like OpenAI and Google have tried to build a detector they could trust, to no avail. Could it be that they haven’t tried hard enough? In…

--

--