Generative AI Could Pollute the Internet to Death
Talking about long-term problems
Generative AI is the most transformative application the field has ever seen. It will redefine how we create but also how we interact with and relate to the creations of others.
Whereas traditional AI allows us to extract patterns and insights from data, shaping them into new knowledge, generative AI goes beyond that. It uses that data to generate more data.
And that isn’t even its most profound implication. The fact that its usefulness manifests at the consumer level will change everything.
Anyone can use generative AI to create new data.
We’re living in an unprecedented era of creative expansion. What historically has been reserved for the few is now within reach for anyone with a computer and internet access.
Most people are still unaware this technology exists, but it won’t be long before it becomes mainstream. It’s easy to access and use, super cheap, and extremely versatile. And it improves fast.
Generative AI’s potential at the individual level is huge, but at the collective level it's life changing.
At that level, what matters most is scale—not as in “large enough to solve a problem,” but as in “large enough to cause one.”
The fast-paced development combined with transversal usefulness and inherent scalability (easy-to-use and cheap) is generative AI’s greatest strength—and its greatest weakness.
It’s not the tools it’s how we use them
First, as I’ve written in the past, I think generative AI tools can help enhance human ability—writing, painting, coding, and anything else that may come next.
Second, not everyone uses these tools to mindlessly generate content. Some truly explore their creative selves. They imbue their creations with intent and personality (even if it’s impossible to capture them fully with words).
These caveats reveal that this “weakness” isn’t intrinsic to the tools—it’s not about “they lack intent,” “AI art isn’t art” or anything of the sort.
Instead, the problem emerges where these tools intersect with our lack of a sense of measure and the external incentives we all are subject to—when our goal is to generate as much content as we can to obtain some benefit, the story changes.
Many people won’t just enhance their abilities, they’ll replace their presence—using the tools at every chance. If we can use these tools for any creative activity, many (if not most) will use them for all creative activities.
Also, people will use them as surprise boxes, not as creativity explorers: “let’s see what comes out on the other side and hope it’s good enough.”
The problem is not in the tools but in our use of them.
The Algorithmic Bridge is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
The ultimate digital catastrophe
The conclusion that follows from the above is that, eventually—and inevitably—we’ll flood the internet with AI-generated data.
(I’ll use data/info/content/media to refer to anything that can be accessed on the internet, whatever its origin, form, or purpose.)
In an attempt to debunk this hypothesis, tech blogger Gwern Branwen came up with a series of arguments explaining why this is unlikely to happen.
He argued against the idea that future generative AI will “choke to death on their own excreta.” Although his focus is more specific than mine, it’s close enough.
His is the most exhaustive and well-formulated case against this conclusion that I’ve found, so I’ll use it to try to knock down my argumentation.
Let’s see why and how this digital catastrophe may happen.
The rise of AI-generated data
My belief is that we’ll get to a point, however far in time, when most text, code, images, videos, etc. won’t be human-made but AI-generated (I’ve written about this here, with a more philosophical lens).
Of course, it doesn't imply that AI systems will create/have created all data on the internet.
The data that’s already there won’t be deleted, and it’s mostly human-made. Also, people will still create data using alternative methods, regardless of speed limits.
However, as generative AI improves and becomes more accessible to the general public, the difference in speed at which we, collectively, create data one way vs the other will only increase—the percentage of human-made data will only get smaller.
The question is how fast and to what degree.
Against this, Gwern says it doesn’t matter because there’s “no such thing as 'natural' media on the Internet.” Everything we see or read on the web is heavily treated by algorithms that partially remove the “human-made” label.
This is undoubtedly true (at least for visual media). But I think it’s important to factor in the degree of “syntheticism”.
While an Instagram picture may have filters or format restructuration, a DALL·E image belongs to a different category of synthetic media.
With text-to-image models, the human input is a text string. It’s the AI that comes up with a visual representation—which isn’t just a minimal transformation—and the process in between is opaque (impenetrable) and stochastic (hardly repeatable).
The degree to which data is more or less human-made matters: AI-generated data is the furthest from “natural” we can have as the human input is at the minimum.
While this doesn’t necessarily reduce the value of the result, puts it into a category of its own.
It’s already happening—fast
I don’t know how deep the hole will be, but it’s happening quite fast. Gwern said “it hasn’t happened yet,” but that was 5 months ago. Stable Diffusion was merely a rumor at that time.
The limiting factor now is computing power: how many accelerators (i.e. GPUs) do generative AI companies have access to? how powerful are they? how much memory do they have?
If we follow the money (e.g. Stability.ai, Jasper.ai), it’s clear VCs/investors will ensure the generative AI space doesn’t find that limit anytime soon.
And, as long as there are GPUs to generate data, we’ll do it.
Let’s look at the current numbers.
On October 17, TechCrunch reported that 1.5M people had signed up on DreamStudio (the official UI for Stable Diffusion—now one among many). In total, they’ve generated around 170M images. Emad Mostaque said “across all channels” Stable Diffusion has +10M daily users.
If we extrapolate the number of images per user per day from DreamStudio data to the 10M total users (a conservative assumption given that those who use other UIs are likely more deeply involved), we have that people have created around ~1.1B images with Stable Diffusion. In 2.5 months.
Midjourney, another popular text-to-image model, has 3.7M Discord members. It opened the beta in March. If we take DreamStudio’s numbers and assume a similar growth rate (a reasonable assumption given that both subscription models are similar), we have another ~1.3B images.
With OpenAI’s DALL·E (1.5M sign-ups creating 2M images/day since September) we can easily add another 200M (the model was announced in April but OpenAI let in people slowly throughout the months).
In total, that’s 2.6B images created in half a year with three models. There are ~750B images on the internet. If my assumptions are correct, around 0.35% of all images on the internet are now AI-generated.
Not many. Yet.
Now, imagine that Stable Diffusion grows in a linear fashion (probably a conservative assumption) during the next 5 years to 1B daily users (a plausible total growth if we assume the tech will mature and be integrated into popular products and services).
Under those conditions, assuming everything else is constant and accepting my previous assumptions, we’d have ~2.7T AI-generated images by mid-2027.
That’s four times all human-made images on the Internet. And a similar future can be expected of text, code, etc.
But, why is this a problem?
I’ve argued why and how this is likely to happen. Now I have to answer the most relevant question: is it really a problem?
On the one hand, it’s unnecessary. This is the weaker form of the argument, which concerns indirect effects like attention scarcity and info overwhelm.
On the other hand, it’s detrimental to the health of our digital town square. A stronger argument that concerns direct effects like low-quality data overflow, AI’s unreliability, misinformation…
For this second argument, I won’t focus on unreliability or AI's tendency to make mistakes. I’ll obviate those as there’s a lot written about that already. I’ll go for the—I think—novel argument: the hegemony of low-quality data.
Do we really need all this AI-generated data?
The Internet already contains more info than any human (or all combined, for that matter) could ever consume in a lifetime. Just on YouTube, users upload 30 years’ worth of videos every day.
It’s hard to argue we need more. Instead of data scarcity (we’d be hungry for more) we have attention scarcity—we’re overwhelmed.
Still, we have the incentive to create more data because the majority isn’t intended to convey or store ideas, thoughts, or feelings, but to attract attention (to achieve some further goal).
The combination of “we already have enough” with “we need to create more” is bad. And generative AI will only worsen this situation (Gwern seems to agree here).
Many people who use generative AI don’t have the goal to transmit their intent—they just want to enlarge the space they occupy in the ultra-competitive field that is the Internet.
There’s a counterpoint here, though. Not everyone matches this description—some people are true creative explorers.
Generative AI is a tool that can allow anyone to explore latent spaces in ways humans can’t. An example is AI’s ability to merge subjects with styles. I can use Dreambooth to transform images of myself in the style of The Simpsons, or fine-tune GPT-3 on Shakespeare’s texts and make it generate rap songs.
In this sense, AI-generated data is totally worth it.
However, I’d argue that this accounts for a minuscule portion of the total AI-generated data.
Most will exist for the purpose to be seen as a means for something else (in contrast to being an end in itself), only to eventually end up pilling up on the ever-larger dead corners of the Internet.
The hegemony of low-quality data
The quantity problem is a lesser problem if we compare it with the quality problem.
The Internet is already low quality on average. I won’t argue humans produce super high-quality data whereas AI systems don’t because, actually, generative AI is likely to be biased toward high quality (given that the training datasets are partially curated).
However, the super high-quality minority at the top it's exclusively human-made. In part because AI kills intent, in part because we’re more creative, in part because AI content is dull, etc.
AI-generated data is nothing more than a concoction of data that already exists—even if it’s a higher-quality-than-average subset. It’s hard to make a case for it being higher quality data than the highest quality data humans create.
AI-generated data, due to sheer quantity, could erase the very purpose of the Internet—keeping large amounts of high-quality info at the surface, easily accessible.
AI could bury high-quality, intent-driven human-made creations under tons of generic and bland content.
Even if in the absolute sense AI-generated data is of higher quality than average, it’s hardly ever interesting, entertaining, or engaging.
If you’ve tried popular writing tools or AI art models, you’ve probably noticed this, too. Even when they behave as expected, they’re boring.
The reason is their statistical nature and the objective they’re trained on makes them default to the center of the distribution of creative possibility.
Generative models will find the “safest” output given the input whereas human creativity is best defined by risky and innovative expression. You won’t be able to create an outstanding, uniquely insightful essay with GPT-3, because it hasn’t been created to explore those areas of the latent space.
Of course, humans also tend to fall at the center of the distribution (by definition). The argument I’m making here is that, with generative AI, both tendencies reinforce each other, resulting in the most generic and dull content possible—which, in high amounts, will uniformize the Internet, making it asymptotically super boring.
If that’s not low quality, I don’t know what is.
You can always try to force generative models to the tails of the distribution with clever prompting and fine-tuning, but it’s hard to do it reliably because of the models’ disconnection from the world. The cost is often trading off coherence for randomness.
Humans can be creative and coherent at the same time.
Summarizing: generative AI is getting traction inside and outside of the tech sector. It's happening very fast. And will overflow the Internet with low-quality content.
I don't know if everything I’ve described here will happen. What I'm sure of is that I’d love to hear counterpoints to my arguments so I can change my mind.
The Algorithmic Bridge is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
This could well have been written in the mid-90's and it would have been totally correct (as it is now). Anyone who looks at the results of a spam filter can see that the internet is totally dominated by very very very low quality data right now and has been for many years. Fortunately the species turned out to be bright enough to invent filters. If filters need to be tweaked to accommodate the tons and tons of spam that absolutely will be generated by AI, fine. Am I missing the problem? If so, say more.
You are correct that AI content will flood the internet, and your implication that there should be some near-term delineation in search engine results between "human-generated" (i.e. pre-2022) and "AI-generated" content. Your (significant) mistake is in your assertion that AI-generated content is "hardly ever interesting, useful, entertaining, or engaging." This is far far far from the truth.
I am a classically trained artist and painter. I am also an AI software designer. I can tell you factually: what has happened in the generative art space in the past 6 months is a profound tectonic shift. "Deep fakes" were a theoretical concern 5 years ago. Today, it is trivial to generate artificial photographs on MidJourney that are hard if not impossible for even an expert to distinguish from real photographs. And since the invention of the camera, photos have often been seen as "ground truth" for evidence of factual events (regardless of the editorial trickery of altering cropping / exposure, etc). For the past 15 years, we could effectively "PhotoShop" all kinds of falsehoods, but that took time, money, and expert skill. MJ now does it in 60 seconds, no skill required, for pennies. I have shown some of my favorite MJ creations to both professional artists, photographers and laypeople alike, and the reaction is unanimous: "No, they're not 'as good' as a human could create, they're *better* than human."
Get used to that: better than human. In the course of my career I have both hired and been hired to produce commercial photography and illustration, often spending between $5k and $20k for a single photo shoot, that had dozens of specialists on set (lighting, makeup, food prep, set builders), and took months to arrange. MidJourney has produced in many cases *better* results than those shoots, for pennies, in seconds. This is not "junk content." This is mindblowing creativity that is rapidly transcending human capability.
Get ready for a future where your entire internet experience is livestreamed to you, created on the fly *just for you* by generative AI bots. Dangerous? Yes. Echo Chambers? Yes. Addictive? Totally. Profitable? (to the global megacorps and social media companies) Indubitably.