From Siri to Photoshop to Google Search—Large AI Models Will Redefine How We Live
GPT-3 and DALL·E-like AI models will power services and products you use every day. But, are we ready?
When we think about consumer technologies that have redefined how we live and interact with one another in the 21st century, two come to mind immediately: Smartphones and social media. We’re about to witness the emergence of the third.
Apple’s announcement of the iPhone in 2007 marked a “before and after” for the internet, advertising, software distribution, and phones themselves (remember the BlackBerry?). The iPhone slowly but surely reshaped our day-to-day relationship with the digital world. We can no longer fathom a world without a smartphone in our pockets. And the same happened with social media. From Facebook to TikTok, these algorithm-driven feeds of posts and news govern our virtual relationships—which these days take up even more time and space than our physical ones. The world has changed so much in the last twenty years that it’d be largely unrecognizable for anyone who lived in the 20th century.
During that same period, artificial intelligence (AI) reached a new stage of interest, funding, and development—intrinsically entwined with that of smartphones and social media. Since 2012, deep learning-focused research has yielded impressive results. First, computer vision systems and then, since the ascent of Google’s transformer architecture in 2017, large language models. In merely a decade the AI community has developed a better understanding of neural networks and scaling laws, built larger and better-quality datasets, and designed powerful hardware, increasingly well-suited to handle demanding AI workloads. From 2012 to 2022 the AI field has evolved at an unprecedented rate of progress.
Today, generative large language models, together with multimodal and art models, dominate the landscape, and tech giants, ambitious startups, and non-profit organizations aim to leverage their potential—either for private benefit or to democratize their promises.
One startup in particular, OpenAI, has been a key player during the last five years. The company didn’t start the fierce race to the hegemony of AI but accelerated it notably in mid-2020 with the release of GPT-3—arguably the best-known AI model of the decade. This 175-billion-parameter monster outsized its predecessors (GPT-2, BERT) by 100X, demonstrating the applicability of scaling laws for language models: Bigger was significantly better. News articles about GPT-3’s capabilities went around the world and captured the attention of companies, investors, and consumers alike. The two years that have passed since have been completely crazy.
From GPT-3 to LaMDA. From DALL·E to Stable Diffusion
Apart from GPT-3, another very well-known language model is LaMDA (137 billion parameters). Announced by Google in May 2021, it got the attention of the public eye after ex-Google engineer Blake Lemoine claimed the model was sentient earlier this year. Of course, it isn’t sentient in any sense of the word, and neither is PaLM (540B), Google’s latest development in language AI, published in April. PaLM currently holds the title of largest dense language model—and that of highest performance across benchmarks. It’s state-of-the-art (SOTA) in language AI.
Google, although always at the epicenter of AI research, isn’t the only big tech company to publicly showcase its presence in this AI race. Meta, formerly Facebook, has also made its strides. Three weeks ago, the company announced the third version of BlenderBot (175B), a powerful language-model-based chatbot that they released in the US for people to play with (it didn’t end well, more on this later). Also, and maybe more surprisingly, in May Meta announced OPT (175B), GPT-3’s open-source brother. Microsoft (OpenAI’s main founding source) and Nvidia (whose ubiquitous GPUs are used to train and run AI models) joined forces in October 2021 to develop MT-MLG, a 530B model that, together with PaLM, makes the others look small in comparison.
Smaller AI-centered companies like DeepMind and OpenAI have managed to stay on top despite the aggressive competition from wealthier rivals. DeepMind joined the race at the end of 2021. Its first model, Gopher (280B), overperformed all previous models achieving SOTA status. A few months afterward, in March 2022, the company announced Chinchilla (70B), significantly smaller than all the other language models, but more performant (it achieved a new SOTA, now only surpassed by PaLM). DeepMind used Chinchilla to rediscover language models’ scaling laws and proved data is just as important as size. OpenAI, of course, wouldn’t be less. The company improved GPT-3 to a more aligned version, InstructGPT and, if the predictions are accurate, GPT-4 should be around the corner. Other startups have copied OpenAI’s business model and offer large language model services in a pay-as-you-go fashion. The most prominent are AI21 labs and Cohere. Their best models are easily comparable to OpenAI’s best models.
In the non-profit space, collective initiatives focused on open science and open source are also claiming a portion of the pie. BigScience, in collaboration with Hugging Face and with the help of EleutherAI and other organizations, developed BLOOM (176B), a model born from a foundation of ethical and inclusive principles and values. I argued BLOOM is “the most important AI model of the decade”—a bold claim backed by thought-provoking arguments.
What I’ve described here is an illustrative—and incomplete—picture of how large language models have changed the AI landscape and the industry’s goals. However, AI companies didn’t stop at language. The world is multimodal and we’re multisensorial. It makes sense to try to imbue this multidimensionality into AI systems. This is the origin of multimodal models (Google’s MUM), visual language models (DeepMind’s Flamingo), generalist agents (DeepMind’s Gato), and now, the most popular trend in AI: Diffusion-based generative visual models (also called AI art models).
Generative visual models—also trained on language partially—are significantly smaller than their language-centered counterparts but hold comparable world-shaping potential. Again, it was originally OpenAI that popularized this type of model with DALL·E and CLIP in early 2021, GLIDE in late 2021, and, earlier this year, with DALL·E 2, which sparked the AI art trend we’re immersed into. In the meantime, other companies were developing their own: Microsoft published NUWA in 2021. Meta built Make-A-Scene in March 2022. And Google announced Imagen and Parti in May and June 2022, respectively.
But the most interesting and useful models are those we can use. In the beginning, we only had Google colab notebooks (Disco Diffusion) and then developers started to build no-code easy-to-use apps on top of diffusion models. The best-known models besides DALL·E are Craiyon (formerly, DALL·E mini), Midjourney, and the one that is in everyone’s mouth these days; Stability.ai’s Stable Diffusion, which I recently dubbed “the most important AI art model ever.” These models, some behind paid memberships and others free-to-use, are redefining the creative process and our understanding of what it means to be an artist (read this if you want to know more).
After these seven long paragraphs revisiting the most impactful news of the last two years, what if I tell you that these AI models (that mix language-, multimodal-, and art-based features) are going to become your next virtual assistant (a smart and truly conversational Siri or Alexa), your next search engine (an intuitive and more natural Google Search or Bing), or your next artistic tool (a more versatile and creative Photoshop or GIMP)? Research models are going to become tangible products.
Shifting from research to production is a challenging process. Tech companies already have experience because computer vision systems enjoy a multi-year advantage over language AI. Once companies like Google, Meta, or Amazon considered the technology was mature enough, they embedded vision-based AI systems into their existing services and products. Face and emotion recognition, object detection, ID identification, pose detection, and feature extraction software are implemented into services and devices all over the world, across industries and markets. The same thing happened with recommender systems, which now power all social media algorithms, from Facebook to YouTube to TikTok, and many other consumer internet services, like streaming (Netflix), music (Spotify), or shopping platforms (Amazon).
This is what’s going to happen, sooner or later, with language and multimodal and art models. This shift from research to production will entail a third technological revolution this century. It will complete a trinity formed by smartphones, social media, and large AI models, an interdependent mix of technologies that will have lasting effects on society and its individuals. How is this going to impact the world at large and all of our tiny private worlds? How is it going to redefine our relationship with technology and with one another? In which unforeseen ways will it affect our daily lives? We’ll find out sooner than later.
The third technological revolution of the 21st century
Siri was released by Apple in 2011. That was even before the deep learning paradigm took the AI community by storm in 2012. Siri’s most relevant feature is that it’s integrated natively into every iPhone. At the time of this writing, there are around 1.2 billion active iPhones. That’s a lot of people with access to Siri. In contrast, “only” around 1 million people use GPT-3. But unlike GPT-3, Siri isn’t smart, and not very versatile. It’s so limited that more and more people have stopped using it every year.
Now, imagine what a Siri-size multimodal, multitasking AI model could do. It could write your emails, Tweets, and even essays (not like this one though). It could search the internet for you, and summarize key news about what’s happening around the world. It could describe with words the picture you just took with your phone cam, and draw a painting from it. In the style of Rembrandt. It could improve the lighting and make your smile more beautiful (is that possible?). It would then share the pic across social media with the perfect caption for each platform. Wait, you don’t like your outfit? No problem, let’s redesign it until you like it… This is just a small fraction of the endless possibilities you’ll have from the comfort of your phone. Truly a revolution in consumer technology.
But there are important challenges ahead. From a hardware perspective, Siri exists because it easily fits into any modern phone chip. However, GPT-3-like models are significantly larger. The sheer size of these models—and the untapped multi-billion market they’d open if they were deployed—is the reason behind the emergence of so many AI hardware startups in recent years. They build chips better suited for AI workloads to simplify the transition from research to production, allowing AI companies to reduce the time to market and implement their models into everyday devices like smartphones and tablets. We’re not there yet, but millions of dollars move these companies forward (both, hardware- and software-focused) and they’ll eventually solve the most pressing deficiencies of current hardware.
Stability.ai, the company behind Stable Diffusion has chosen another approach to overcome hardware limitations. They’re downsizing the models to the maximum so they fit into consumer hardware. Stable Diffusion is the only high-quality AI art model that fits into a GeForce RTX 3090 Nvidia GPU ($1500-3000). And, as founder Emad Mostaque said on Twitter, they’re trying to further reduce Stable Diffusion’s size to 100 Mb.
Stability.ai is also working on large language models like GPT-3. Mostaque said they could train one for less than $1M but they prefer to leverage Chinchilla’s scaling laws and take advantage of lower-size models that show similar performance. Stability.ai is clearly focusing its efforts on consumers—this also hints that we’re very close to having these systems implemented into everyday devices (maybe even natively).
When AI isn’t ready for the world wide wild
But there’s a final and more daunting barrier ahead. At least if we consider the societal and ethical consequences AI systems can have. To understand this problem we don’t have to make predictions or draw imaginations about what could be—because it has already happened. Recall that I mentioned that computer vision and recommender systems are already deployed across devices and industries, right? We can take a look at their shortcomings and judge by ourselves what could happen if we did the same with even more powerful technology if it’s not ready. (There are a lot of upsides too, but I’ll focus here on the problems.)
Although computer vision and recommender systems are better understood, easier to control, and more interpretable than transformer-based models, like GPT-3, LaMDA, or diffusion-based models like DALL·E, or Stable Diffusion, they can also be unpredictable and have consequences that are often dismissed—and only handled when the harm is done.
In 2015, Google apologized after a Photos app they had just released labeled a couple of black people as “gorillas.” The system was clearly not ready, and not because it was not capable of labeling people, but because of the inherent racist bias in the training dataset. How did Google solve it? Three years later they removed gorillas from the training set. Just a band-aid. The same thing happened again in 2021 when Facebook’s automatic recognition system labeled black men in a video as “primates.”
The problems with recognition software don’t stop at mislabeling. Crime prediction systems, which have been widely criticized, contain the same biases against non-white people that can be found in the datasets companies use to train them. Professor Chris Gilliard wrote an illuminating article on Wired telling the story of Robert McDaniel, who lived in a conflicting neighborhood and was shot twice after being put under a predictive policing program that marked him as a “person of interest.” About the system, Gillard writes:
“This is not merely a self-fulfilling prophecy, though it certainly is that: It is a system designed to bring the past into the future, and thereby prevent the world from changing.”
Face and emotion recognition systems are subjected to the same issues. As psychologist Lisa Feldman Barrett says, “it is not possible to confidently infer happiness from a smile, anger from a scowl, or sadness from a frown, as much of current technology tries to do when applying what are mistakenly believed to be the scientific facts.” Despite the “shaky scientific ground” under this tech, companies began releasing these systems in 2016. Some examples are Google’s Cloud Vision API, Amazon Rekognition, Microsoft’s Face API, and startups like Affectiva and HireVue, acquired by Apple in 2016. (Recently, Microsoft and HireVue took down their systems, following recommendations from AI ethics.)
Recommender systems also have shortcomings that can devolve into very real harm. Journalist Mitchell Clark wrote for the Verge in July that “the TikTok ‘blackout challenge’ has now allegedly killed seven kids.” They died last year trying to replicate a viral ‘challenge’ that consists in choking themselves. Smith and Arroyo, who filed the latest lawsuit against TikTok, argue that the content was promoted to their children through the For You feed.
In case you’re still not convinced because vision is different than language, here are three pioneering cases of AI language models being deployed in the wild. The results speak for themselves. In 2016, Microsoft released Tay, an experimental chatbot to better understand conversations. In less than 24 hours, Twitter turned it completely racist. Saying this is plainly illegal in some places:
GPT4-chan’s case was overtly intentional. ML researcher Yannic Kilcher trained the chatbot based on an open-source backbone architecture on 4chan texts (4chan is infamous for the toxicity of its users). Then he released the bot as a “prank and light-hearted trolling,” in the same 4chan board from which it was trained. It “perfectly encapsulated the mix of offensiveness, nihilism, trolling, and deep distrust,” Kilcher said. “The worst AI ever.” The perfect app to have on your kid’s smartphone.
The last example happened just a few weeks ago. Meta released BlenderBot 3 to people in the Us to talk to it. Just look at this:
As a comment says, sarcastically: “is that why it’s US-only at the moment? because they can’t even stop it from saying stuff that’s eg illegal in Germany?” It’s quite obvious these models aren’t ready to be deployed in the world unless, as Gillard says, we want “to prevent [it] from changing.”
As for generative visual models, I’ll just say that people are already creating non-consensual pornography with the faces of celebrities (I’ve seen it. We believe humans are complex but then you witness simplicity at its finest). That’s without mentioning the well-known biases that are present in the datasets and have forced companies like OpenAI to implement guardrails so strict that are making DALLE unusable.
It’s well-known by anyone working or keeping an eye on the AI field that language and art models are subjected to the same problems that pain computer vision systems. The world is toxic and biased against discriminated minorities. The datasets companies fed these systems reproduce those biases and then perpetuate them when these systems are trained and deployed (even ethics-centered systems like BLOOM are victims of this).
We’re on the verge of a technological revolution that will redefine and reshape the way we interact with technology and with one another. But, are we ready for this? Is the technology ready for this?
Some arguments defend the opposite stance; that these models can—and should—be deployed despite the problems they may cause downstream:
“Technology is always a double-edged sword. It can be used for bad but that shouldn’t stop progress because it can also be used for good.”
“These AI systems will never be 100% bias-free. By these standards, they’ll never be ready to deploy.”
“These AI systems simply reflect the world as it is, full of bias, discrimination, and toxicity.”
“Applying filters and censorship to AI systems reduces freedom.”
I won’t go into detail to refute them now (will do it in a future article). For now, consider them food for thought. Leave me a comment with your thoughts. Do you think the deployment of these large AI models will have the effects I’ve described? Do you agree with these counterarguments?
What’s your craziest prediction on how these AI models will impact our lives once they’re moved into production and are embedded into real-world not-niche products and services?