In 2014, the webcomic XKCD made a joke about how most people don’t understand what problems are hard to solve in computer science, using the example of identifying a bird from a picture. What made this joke funny, was that regular people thought this sort of thing should be easy, but experienced developers would know that computer vision projects like this might take a team of PHDs five years in a research lab to solve (if it was even possible).
OpenAI added vision to ChatGPT in September 2023, then started making vision available to developers a couple of months later at their first developer conference. Today, something that was generally accepted as “virtually impossible” ten years ago, is now a trivial task that’s available to anyone for $20/m in ChatGPT (or a fifth of a penny through the API). This is one of those things that ChatGPT can do that most people I talk to have not tried yet, and so most people are unaware of how magic this is.
Models like GPT-4 Vision are referred to as “Multimodal” models, or LMMs (Large Multimodal Models, as opposed to Large Language Models, LLMs), because they accept multiple modes of input. Claude 3 by Anthropic also has the capability to process images, and Google’s Gemini 1.5 can do both images and video. There are even open-source models becoming available, like LLaVA 1.6 (a fine-tuning of Meta’s open-source LLaMA), and Qwen VL by Alibaba. These models can be used for classical machine learning use cases like object detection, image classification, and optical character recognition (OCR).
Temporal Reasoning
Computer vision has long been a promising field, and a lot of these image processing tasks already had working solutions long before OpenAI added vision to ChatGPT. Our current generative AI boom can be traced back to AlexNet, a neural network trained by a team featuring Ilya Sutskever (OpenAI co-founder) on GPUs (from Nvidia), to win an image recognition competition in 2012. What’s different about adding vision to a large language model, is that for the first time we ask questions about an image, and even make predictions about what will happen next.
```
GPT-4V: The images show a person in a motion capture suit in a studio with multiple cameras and sensors. The person is performing a push-up exercise. In the first image, the person is in the starting position with their arms extended and body in a straight line. In the second image, the person is lowering their body towards the ground. In the third image, the person is at the bottom of the push-up with their chest close to the ground. In the fourth image, the person is pushing their body back up to the starting position.
```
As the image above from “The Dawn of LMMs” shows, you can give it a handful of images and it can tell not just what’s in the images, but what is happening and what is likely to happen next (the person will do a pushup). This is a huge leap forward compared to traditional image recognition techniques, which could classify individual images and identify what was in them, but couldn’t easily link sequences of images together. This temporal reasoning ability, being able to understand what’s happening over time, opens up far more sophisticated applications than were previously possible.
As an example, we did a project recently at Brightpool where we had AI ‘watch’ hundreds of hours of Twitch streams of video game play, by splitting up the videos into sequences of images and having GPT-4V commentate on what was happening in the game.
We ran into all of the same problems that plague any AI project, like hallucinations and refusals to do the task, as well as occasional issues with quality. The key was developing a good suite of evaluation metrics, by manually reviewing a number of sequences and writing our own commentary, as well as flagging specific moments that stood out as interesting in the gameplay. We then could test whether the AI would flag similar moments to ours, which we could test for by calculating a similarity score (in technical terms, the cosine similarity between vectors).
Also like any AI project, cost and latency are real factors, because calls to GPT-4V only take a few seconds and cost less than a penny, these costs add up when you’re processing hundreds of thousands of images. Processing a frame every two seconds means an hour of video needs 1,800 calls to GPT-4V, which costs around $4 per hour, and is not much faster than watching the video yourself. There are ways to speed this up, and competition from Anthropic, Google, and open source models like LLaVA 1.6 (shown below) will help bring down costs, so expect an explosion of use cases in the near future.
The Five Principles of Prompting
The good news is that if you know how to prompt an LLM, you already know how to prompt an LMM. The same principles apply to vision models, and the same tricks work. To showcase this I’ll work through the five principles of prompting I developed back when GPT-3 was state-of-the-art, that still work today with GPT-4V (and will continue to work in the future with GPT-5). The examples are taken from “The Dawn of LMMs” paper.
Give Direction
Just like with text models, multimodal models do better with clearer instructions. The way you do that with vision is by drawing on the images to highlight the parts you want to ask questions about. For example, if you want to reason about an object in the image, you can circle it or draw an arrow. This works for tables of data as well, where circling a column or row of data can help improve the accuracy of any questions you ask about it.
Note that this doesn’t have to be done manually, as you can use a traditional object detection algorithm to find the right coordinates for annotation, which runs far faster than an LMM. One common trick people are using is drawing a grid on an image, then asking the LMM to specify parts of the grid to interact with in the prompt. This technique has been shown to work well with browsing agents who can then choose where on the grid to ‘click’ to take the next action.
Specify Format
As multimodal models are powered by LLMs, they have all the same capabilities in terms of structured outputs. One common use case is to extract valid JSON from an image, which can then more easily be parsed to be displayed in a web interface, or stored in a database. You can also specify other data structures, like ordered lists or YAML.
Typically adding an example of the structure of the data you want in the prompt helps, as does specifying what should happen when the data for a field is not readable or available. Without specifying what should happen in these instances, you might run into hallucinations or situations where the LMM “breaks character” and adds an extra disclaimer instead of responding in the right structure.
Provide Examples
One of the things papers always talk about when evaluating models is how they do with zero-shot prompts (no examples provided, just the task), versus how they perform with one-shot, or few-shot prompts. This is because it makes a huge difference to performance, and that’s no different with multimodal models. Happily, you can provide multiple text and image pairs in the prompt, so it’s possible to provide examples of how to do the task. This is demonstrated in the speedometer reading example below, where providing two examples of the task being done helps the model accurately read the speedometer.
In the days of completion models like GPT-3, it became standard practice to include examples of the task being done in the prompt prior to the task you needed done. When we migrated over to chat models like GPT-3.5 and GPT-4, if using the API you could change the system prompt (equivalent to custom instructions in ChatGPT), and therefore most people started putting the examples in there. This is a mistake however, because examples work much better when inserted in the actual message stream, as if you’re pretending you already ran the prompt previously and got these responses from the assistant. This is doubly important with image prompting, where the model might not know which prompt refers to which image if you send all of them at the same time.
Evaluate Quality
As is always the case when working with AI, sometimes you get inaccurate results or hallucinations. One surprising example is when generating bounding boxes: asking GPT-4V to draw boxes around specific people or objects. It is trained to be able to do this, but gets regularly confused when images are cluttered or boxes need to be close together. You can see this with Andrew Ng’s box on the right, which ends up too far to the right of center for his body.
It’s important that you start collecting a list of images that exhibit common errors or edge cases for testing. Once you have a set of images and expected outputs, you can test new versions of the prompt and see how well it mitigates these issues. In addition, you can cherry-pick good answers where it got something right, and add them into your prompt as extra examples. With 50+ positive examples you will be able to run fine-tuning on open-source models, or GPT-4 when that becomes available.
Divide Labor
It’s rare that you can do an entire complex task with a single prompt. Often you need to split the task into multiple smaller tasks, and optimize each one before bringing them together. Even if you don’t split the prompt, it can help to instruct the model to think through the steps, which can lead it to more accurate responses, by giving it more time to ‘think’. In the example with the apples below, asking it to count row-by-row leads to more accurate results than simply counting.
Given that GPT-4V is expensive and slow relative to GPT-3.5 or open source models, it can often be helpful to separate out the vision side of the task and then use regular LLMs for follow on tasks using the results of the vision prompt. For example, in processing video, you can process every second image in a sequence by asking it to describe what’s happening, but then use those descriptions to do further analysis with cheaper and faster LLMs who only see the text.
Long Context Windows
GPT-4V can handle up to 128,000 tokens, which we found could handle about 30 images alongside the rest of our prompt. Anthropic’s Claude can do 200,000 tokens, but is limited artificially to a maximum of five images at a time. With Google’s Gemini 1.5 model being able to handle up to 1 million tokens, as well as video (done internally by splitting them into one frame for every second), you can insert whole movies, series, or books into the prompt and get out structured data.
https://x.com/DynamicWebPaige/status/1763834266696495243?s=20
Nobody knows all of the implications of this yet, but we know from Tesla’s work in self-driving cars that vision is the primary input that matters. Their application of their self-driving code to create humanoid robots shows that having a system that can reason over visual data opens up a world of possibilities for integrating AI into our everyday lives, in ways more useful and natural than typing to a chatbot. Vision data is also likely to be far more valuable than text data for companies that collect it: afterall, a picture says 1,000 words! Yann LeCun makes the point that a 4-year-old child has seen 50 times more data than the biggest LLMs, given how much data is passed along our optical nerve fibers. Whatever happens in LLMs, vision is likely to play a large role, and it’s worth learning how to interact with these models now.