GPT-4 is here. The long-awaited and most-anticipated AI model from OpenAI was announced and launched as a product yesterday, March 14 (confirming the rumors first reported by Heise). People are already talking a lot about GPT-4, but I’ve yet to see a succinct overview of its ability, significance, uniqueness—and disappointments—in one place.
That’s what this is: everything you need to know about GPT-4 in ten keys. Most of the citations are from the technical report, the research blog post, or the product blog post (there’s a lot of info overlapping so don’t worry about reading them in-depth). Also, I’ll write follow-up articles for TAB if I see fit as new info or stories come out.
Multimodality: The first good multimodal large language model
Availability: ChatGPT+ and API
Pricing and enlarged context window
High performance on human exams and language/vision benchmarks
Predictive scaling: what will future models be capable of?
Improved steerability to control GPT-4 better
Limitations and risks (and modest improvements)
A super-closed release: bad news for the AI community
Microsoft has revealed that Bing Chat was GPT-4 all along
A short compilation of what GPT-4 can do
1. Multimodality: The first good multimodal large language model
The most salient feature that differentiates GPT-4 from its kin is that, in contrast to GPT-3 and ChatGPT, it’s multimodal—it accepts prompts consisting of text, images, or both interlaced “arbitrarily” and emits text outputs. As a user, you can specify “any vision or language task,” for instance you can ask it to explain why a meme is funny or take a picture of your fridge and ask for a healthy recipe.
AI experts like deep learning pioneer Yoshua Bengio deem multimodality a necessary step for general intelligence. The world is multimodal (the modes of information go well beyond language and vision) and we humans owe a lot of our unmatched prowess and intelligence to our brain’s multisensory capabilities: if we want AI to understand the world as we do, language alone is insufficient.
One impressive example of the power of multimodality was showcased during the live demo for developers after the announcement. Greg Brockman, OpenAI President and co-founder, took a picture of some scribbled notes on a piece of paper (markup language) and managed to make GPT-4 write a working website:
One downside of multimodal models is that they tend to trade off performance on text/image tasks in exchange for the ability to process them together. This seems to not happen with GPT-4: “over a range of domains—including documents with text and photographs, diagrams, or screenshots—GPT-4 exhibits similar capabilities as it does on text-only inputs.”
2. Availability: ChatGPT+ and API
Sorry to disappoint you but the multimodal version of GPT-4 is still a research preview and not available to ChatGPT users or API customers yet. OpenAI is currently working to improve the app Be My Eyes with deep implications “for the blind and low vision community.” There’s no information on when OpenAI will start to roll it out for the rest of us.
The text-only GPT-4 is already available on the ChatGPT interface for Plus users ($20/month). Just select GPT-4 on the model tab instead of the default (current cap: 100 messages every 4 hours). It’s slower than ChatGPT but more powerful. For those of you who don’t have $20/month to spare, OpenAI says that they “hope at some point to offer some amount of free GPT-4 queries so those without a subscription can try it too.”
It’s also available for developers on the API (there’s a waitlist). You can get priority access if you “contribute high quality evals” on the framework OpenAI has open-sourced to evaluate models like GPT-4. Some companies and institutions are already using it: Duolingo, Be My Eyes, Stripe, Morgan Stanley, Khan Academy, and the Government of Iceland.
3. Pricing and enlarged context window
The API has an important advantage: it allows access to an enlarged context window. GPT-4 supports prompts up to 8K and 32K tokens (25K words), which is up to 50-page documents. Some applications that were unfeasible with GPT-3.5 (e.g. process an entire book in one or a few passes) are trivial with GPT-4. (This option doesn’t seem to be available for ChatGPT+ users.)
Also, if you want to process more data at once, you have to pay more. This is GPT-4’s API pricing structure:
8K tokens: $0.03/1k prompt tokens, $0.06/1k completion tokens.
32K context: $0.06/1k prompt tokens, $0.12/1k completion tokens.
For comparison, the next best model, which underlies ChatGPT (i.e., GPT-3.5-turbo), costs $0.002 /1K tokens (15x less than the cheapest option for GPT-4) and doesn’t differentiate between prompt and completion. Depending on your use case, it could make no sense to switch to GPT-4.
4. High performance on human exams and language/vision benchmarks
According to OpenAI evaluations, we can conclude that GPT-4 is the best language model out there, both on language and vision/multimodal tasks. It achieves a state-of-the-art (SOTA) level in many disciplines and, notably, reaches human performance on problems designed for people, like Bar, SAT, and AP exams.
Before going into that, here’s a quick task that Brockman showcased during the demo with impressive results (and that I tested afterward changing the specifics):
GPT-4 performance on academic and professional exams
Highlights (good and bad):
Uniform Bar Exam: GPT-4 scores on the highest 10% whereas GPT-3.5 scores on the lowest 10%.
USA Biology Olympiad: GPT-4 scores on the highest 1% whereas GPT-3.5 scores on the lowest 30%.
Jim Fan says “GPT-4 can apply to Stanford as a student now” although not at all function as one, as Gary Marcus points out.
AP English Language/Literature: GPT-4 dominates across disciplines on the AP exams except for English (doesn’t improve over GPT-3.5).
Math exams: Although better than GPT-3.5, GPT-4 is notably bad at math (AP Calculus BC and AMC 10/12).
Codeforces: Horace He suggests that the 392 rating is very low. He’s dug into it and suspects that “GPT-4’s performance [on Codeforces] is influenced by data contamination,” which means that GPT-4 has seen the problems it’s being evaluated on at training time (i.e., memorization and not generalization could be the reason for good performance). It’s speculation but, if true, one may wonder if this has happened on other benchmarks.
GPT-4 performance on academic benchmarks (language)
Highlights (good and bad):
MMLU (multi-task language understanding): 11%+ improvement over SOTA and 16%+ over GPT-3.5.
HellaSwag (commonsense reasoning) and ARC: 10% improvement over SOTA and GPT-3.5.
DROP (reading comprehension and arithmetic): Only benchmark analyzed where GPT-4 doesn’t surpass SOTA.
Translated MMLU: “GPT-4 outperforms the English-language performance of GPT 3.5 and existing language models (Chinchilla and PaLM) for the majority of languages we tested.” (Not shown above.)
Notable missing benchmark: BIG-bench (due to contamination).
GPT-4 performance on academic benchmarks (vision/multimodal)
Highlights (good and bad):
Surpasses SOTA across benchmarks except for VQAv2 (visual question answering) and LSMDC (fill-in-the-blank).
Notable missing benchmark: WinoGround.
5. Predictive scaling: what will future models be capable of?
From the paper (emphasis mine):
“A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. To address this, we developed infrastructure and optimization methods that have very predictable behavior across multiple scales. These improvements allowed us to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1,000×–10,000× less compute.”
If I interpret this correctly, OpenAI has presumably found a way to predict some of the capabilities of GPT-5, GPT-6, etc. by using smaller versions of them. But, if they expect new abilities to emerge spontaneously in future models—and they do, otherwise, why would they believe they can build AGI by scaling the models—how do they plan to foresee those? I hope this isn’t OpenAI’s attempt at convincing us that they now have the means to decide preemptively if they’ve gone too far and it’s time to slow down.
If even Ilya Sutskever, who tweeted the controversial idea that “neural networks are slightly conscious,” says we should slow down about releasing “models with these completely unprecedented capabilities,” maybe we should stop kneeling to the unstoppable force of progress and reflect on what we’re doing.
6. Improved steerability to control GPT-4 better
One important advantage that API customers have always had over casual users (those on the GPT playground or the ChatGPT website) is that they could steer the model through “system” prompts that modify and constraint—post-tuning and pre-interaction—the behavior of the model. This feature will be available for ChatGPT users, too:
“Rather than the classic ChatGPT personality with a fixed verbosity, tone, and style, developers (and soon ChatGPT users) can now prescribe their AI’s style and task by describing those directions in the ‘system’ message.”
Brockman illustrated this during the demo event by first making GPT-4 an AI assistant coder to create a Discord bot and then a TaxGPT to draft his tax documents. The system prompt is kind of like a mask—if we accept the Shoggoth-with-a-smiley-face meme that circles Twitter—that the model puts on before the act begins (the act being the AI-user interaction that happens through prompt-completion exchanges). Here’s an example:
Note: when people talk about jailbreaking, they refer to breaking the model out of the boundaries that system prompts introduce, that weren’t present during training or explicitly defined with fine-tuning—even if the model is RLHF-ed.
7. Limitations and risks (and modest improvements)
To avoid hype—which is already unavoidable—OpenAI has made it clear that GPT-4 improves but is still prone to all the same kinds of problems that previous GPT versions had: unreliability by hallucinations and reasoning errors, overconfidence, various social bias, adversarial prompting and proclivity to be jailbroken (e.g., to create disinformation), and risks for privacy and cybersecurity. Gary Marcus correctly points out that, even if GPT-4 is quantitatively better than GPT-3.5 (it definitely is), it’s “stuck at that same wall of truth and reliability.”
According to the Financial Times, OpenAI says this:
“[GPT-4 can] generate potentially harmful content, eg advice on planning attacks or hate speech. It can represent various biases + world views . . . it can provide detailed information on how to conduct illegal activities including developing biological weapons.”
This, again, gives relevance to Sutskever’s remarks that it may be not a bad idea to slow down. Or, if they believe that’s not an option, at least consider these problems as something worthy of more attention than a “laundry list,” as NYT’s Ezra Klein recently argued.
But let’s not diminish the improvements. GPT-4 is better at “internally-designed adversarial actuality evaluations” than all ChatGPT versions:

It’s also notably better than GPT-3.5 on TruthfulQA (after RLHF, although they didn’t check for contamination):

To limit the risks of GPT-4-like models, OpenAI set up a team of experts to adversarially test the model: “The additional capabilities of GPT-4 lead to new risk surfaces. To understand the extent of these risks, we engaged over 50 experts from domains such as long-term AI alignment risks, cybersecurity, biorisk, and international security to adversarially test the model.”

They also improved their RLHF pipeline by adding two components, “an additional set of safety-relevant RLHF training prompts, and rule-based reward models, (RBRMs)” that provide solutions for cases in which GPT-4 may give harmful advice on unsafe prompts or be overly cautious when the prompt is inoffensive. This results in GPT-4 being “82% less likely to respond to requests for disallowed content.”
8. A super-closed release: bad news for the AI community
Probably the most relevant paragraph in the whole technical report is this one, right at the beginning (emphasis mine):
“GPT-4 is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers. The model was then fine-tuned using Reinforcement Learning from Human Feedback (RLHF). Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar.”
There’s no info about the underlying specification of the model. As many AI experts, like Ben Schmidt, Emily M. Bender, Sebastian Raschka, and others have pointed out, this is very bad for the AI community, a shift from previous releases by OpenAI and a sign of the times: competition, profits, and non-accountability over openness about research and methodology. This is the last nail in the coffin of “Open”AI’s name.

Will Douglas Heaven writes for MIT Tech Review that “GPT-4 is the most secretive release the company has ever put out, marking its full transition from nonprofit research lab to for-profit tech firm,” to which Thomas Wolf (Hugging Face co-founder) adds, “OpenAI is now a fully closed company with scientific communication akin to press releases for products.”
9. Microsoft has revealed that Bing Chat was GPT-4 all along
The “next generation” model that powers Prometheus (tailored for search), which in turn powers Bing chat, has always been GPT-4, as confirmed by CVPs Jordi Ribas and Yusuf Mehdi:




Also, it seems that Morgan Stanley could be right (at least partly): “We think that GPT 5 is already being trained.” It may come much sooner than we’d expect now that OpenAI has developed the infrastructure to better predict the behavior of its future models.


10. A short compilation of what GPT-4 can do
I don’t think GPT-4 will feel to most people a truly significant milestone over ChatGPT (as I predicted in December) because, for many, the latter was the first contact they’ve ever had with a powerful language model—the jump from nothing to ChatGPT is quite a big one—and also because multimodality isn’t available just yet.
But we’ll still see impressive capabilities coming from GPT-4 that ChatGPT is unable to accomplish. Linus Ekenstam shares in the thread below the first ones we’ve seen.
Given the modest increase in capability between the last GPT release and this version, when do you think we will get another release that has an order of magnitude of increased capability compared to the previous release?
The article suggests, "....maybe we should stop kneeling to the unstoppable force of progress and reflect on what we’re doing."
Imho, the primary reflection should not be on the technology, but on the strengths and weaknesses of the species who is receiving these powers. As example, if you were thinking of buying your teenager their first car, your focus wouldn't be on the car, but on the teenager. Are they ready?
Or, ok, we could start the analysis from the technology side. We have no idea what future versions of AI will be able to do. Therefore, there's no way to determine if we're ready. Therefore, stop kneeling.
A key problem is that those we will look to for answers to such questions will typically be those who know the most about the technology, those working in this industry, those who aren't in a position to be objective.