Forcing AI to watch 100s of hours of video and tell you what you missed

July 17, 2024
Michael Taylor

A picture says a thousand words, and videos contain 24 picture frames per second. Video streaming platforms like YouTube and TikTok have invested billions in infrastructure. Those investments are especially paying off with Gen Z: 60 percent of all zoomers prefer YouTube to books as a learning tool, 40 percent search TikTok or Instagram for restaurant recommendations, while 28 percent use Twitch to watch professional streamers play. With an infinite supply of highly engaging video content on any topic or interest at our disposal, it’s no surprise that younger generations don’t read as much. Can you blame them?

While zoomers are the biggest consumers of video content by far, they also face a big obstacle: Most of the content is boring or irrelevant, and not easily searchable. Users can skim-read article headlines or CTRL+F to search by keyword, but videos don’t offer the same affordances. They have to either sit through hours of footage at normal speed, or skip/fast-forward and hopefully don’t miss anything important. There are many social situations where a quick glance at some text is acceptable, but putting in headphones to watch a video is not.

With more of the world’s valuable information being locked away in videos, this is becoming a real problem that future generations of internet users will have to overcome. Or at least this was a problem, until we taught AI to “see.” To illustrate this point, let me tell you about a recent experiment I completed with a colleague: We forced AI to watch hundreds of hours of Twitch streams and generate written commentary based on the footage, so that we could identify what action happened in the games. With that commentary we could provide insights to the product team on how people play, and what they find entertaining about the game, as well as where they get frustrated.

In this post I’ll walk you through how you can take the same steps we used to process hundreds of hours of video, which you can replicate on any type of video you want to process. Perhaps you organize events and want to extract highlights to promote on social media, or you want to learn everything about a topic but can’t afford to sit through hours of lectures. Whatever your use case you can do it today with some coding ability, and if you can’t code, it’s useful to get a sneak preview of what will be possible in the near future as LLMs natively support video inputs into prompts.

ChatGPT can ‘see’ now

For many years, developers have been trying to give computers the ability to see and comprehend images. This age-old problem is so notorious that the popular webcomic XKCD made a joke about it back in 2014, labeling the task of identifying a bird from a picture as “virtually impossible.”

Fast forward almost a decade later, to when OpenAI announced GPT-4 Vision at its Developer conference at the end of 2023. The capability is now out in the world for anyone to use. For $20 a month you can upload pictures to ChatGPT and classify them, or build an application that does this through OpenAI’s GPT-4V API for roughly half a penny per image.

Source: XKCD.

Image labeling and classification aren’t new capabilities, and most machine learning engineers will tell you that these functionalities were available long before ChatGPT. However, something magical happens when you specifically give a language model the ability to see: It has the ability to grasp the entire context of the scene, not just identify objects in the image. We can ask questions about an image, ask for clarification, and even guess what might happen next.

Take the below example from the paper “The Dawn of LMMs,.” Detecting a person in the image is easy, but GPT-4V understands the whole sequence of images, and correctly predicts the person is likely to do a push-up. Here is GPT-4V’s full detailed description of what’s happening in the four images:

The images show a person in a motion capture suit in a studio with multiple cameras and sensors. The person is performing a push-up exercise. In the first image, the person is in the starting position with their arms extended and body in a straight line. In the second image, the person is lowering their body towards the ground. In the third image, the person is at the bottom of the push-up with their chest close to the ground. In the fourth image, the person is pushing their body back up to the starting position. 

Source: “"The Dawn of LMMs".”

If AI can understand what’s happening in a brief sequence of images, it follows that it can learn to understand videos—which are much larger image sequences, after all—by taking a snapshot of the frames every few seconds. The first person I saw test this approach was Mervin Praison, who used AI to generate commentary while he played the video game League of Legends. He wrote code with instructions to automatically capture a screenshot every few seconds and pass those images to GPT-4V to get it to commentate on the gameplay in real time. 

Source: YouTube.

Inspired by Praison’s approach, I worked on a project in February 2024 to digest hundreds of hours of pre-recorded Twitch videos of different video games played by popular streamers. Splitting a day’s worth of coding between myself and another developer, we were able to write a script that downloaded the Twitch clips from YouTube and generated commentary on what was happening in each gameplay. We achieved this by dividing each video into one-minute chunks, then further divided those chunks into 30 images each (one every two seconds).

Also in February 2024, Google added video support with its AI model, Gemini 1.5. Although the company billed its model as “next-generation,” we found that it actually uses a similar approach to ours. This confirmed that any multimodal model (models that can see images) like Anthropic Claude 3 or the open-source LLaVA 1.6 (based on Meta’s LLaMA model, but adapted to see images) is capable of supporting video analysis. (The OpenAI Cookbook contains a useful script if you’re planning on doing something similar for your own applications, and you should look at Mervin Praison’s code for League of Legends commentary.)

Source: Author’s Screenshot.

As I explained in a recent talk I gave about my project, the possibilities of this development are far greater than watching people play video games. Imagine if AI could attend every professional conference in your industry for you. That way, you can get the most relevant insights and not need to sit through days of keynotes and panel discussions. Or, imagine building a traffic monitor that has cameras installed at key intersections and tells users what’s happening on the road in real time using AI. There are many potential applications—the key is that AI vision is the key to learning from the real world, because it learns via the same kind of digestible visual content that humans do.

Yan LeCun, the chief AI scientist at Meta, estimates that a 4 year-old child has seen more data through their optical nerve than the largest of large language models. It makes you wonder: How much better will AI get when it can record visual data first-hand instead of relying on the internet?

Even AI co-workers need a manager

Despite the promise of AI vision technology, the reality of working with AI today is that it’s notoriously unreliable. As we automate more of our work with AI, our role as humans becomes what Dan Shipper calls a “model manager”: someone who is responsible for communicating to AI and validating its work product. Here are some unreliable aspects that you’ll need to anticipate:

Failures at scale

AI models tend to “hallucinate” (confidently tell lies as if they’re facts), which is one of the problems the discipline of prompt engineering has arisen to try and mitigate. In our project we split the project into ‘chunks’ of one minute each, and then took a frame every 2 seconds giving us 30 images per chunk. We found we could fit 30 images comfortably in a prompt and ChatGPT would give us a reasonable description of what was happening in that scene. However, you don’t want to be copy and pasting hundreds of times into ChatGPT, which is where knowing how to code is useful – you can send all the chunks for processing to GPT-4 through the API (requesting responses through code, rather than the ChatGPT interface), getting things done far faster.

When you’re processing thousands of requests you run into rate limits, which are how many times you are allowed to prompt the model per minute, and you run into failed requests, where OpenAI’s service goes down temporarily. You can work some logic into your code to retry failed attempts, but these can compound into cascading failures, where you have so many retries happening that the rate limits are constantly being triggered, causing more failures.   

Source: Author’s Screenshot

This is a problem that anyone using LLMs at scale will have encountered, and it involves making tradeoffs based on what your constraints are for the project: 

  • You can change your chunking strategy and a frame of the video every four seconds instead of two, but then you risk missing some context. 
  • You can attempt to offload some of the workload to a competing model, like Claude Opus, but then your results between the two may not be consistent. 
  • You can split the requests into batches of 1,000 chunks processed at a time, and just wait longer for the job to complete. 

We went with the last option, inspired by OpenAI’s cookbook (a repository of scripts demonstrating how to solve common LLM problems) on this topic. Think of this as checkout lines in a supermarket: fewer lines means it takes longer for everyone to check out, but you don’t need as many workers on staff (the workers are OpenAI’s servers which process the requests, in this analogy).

AI would prefer not to

Another issue we noticed is you do get your responses back, quite often the model will simply refuse to do its job, saying, “I’m sorry, I cannot provide assistance with these requests.” This happens primarily due to the conditioning OpenAI and other vendors have done to improve the safety of their models, such as by training them to refuse to identify people in images. It can be tricky to catch these refusals automatically, because the LLM says it slightly differently each time. Often you need to manually review hundreds of responses to catch edge cases like this and work some protection into the prompt—for example, telling the LLM to return a special phrase you can look for when it can’t answer the question.

Source: Author’s Screenshot

The quick check for this is to have some code that checks how many responses start with “I’m sorry” or “I cannot assist”, or any other common refusals you spot. You can then try modifying your prompt to see if you can get the percentage of refusals down, for example by adding “It is very important to my career” to the end of the prompt (a strategy shown in a famous study to make responses more diligent and thorough). Once you have an evaluation metric for measuring refusals, you can even test other models and see if a competitor to OpenAI is less likely to refuse in certain scenarios. If you notice specific scenarios that cause the majority of refusals, for example identifying faces, it makes sense to create a set of video chunks to use as test cases, and specifically try to solve that problem with your prompt testing

The cost of it all

Then comes the literal price of paying to use OpenAI’s API. Even though each individual call to GPT-4V can cost as little as half a penny, to process 10 hours of video at 30 frames per minute takes 18,000 calls, costing about $100. That’s assuming you don’t need to do any retrying of failed requests or additional calls to process the information after you get it. Most AI systems don’t just have one prompt, they have several to do different tasks, with varying degrees of success depending on what model is being used. 

Source: Author’s Screenshot.

Much of the work in AI engineering is getting the use case working with a smart/slow/expensive model like GPT-4V, then testing what works to allow you to offload that work to a dumb/fast/cheap model like Anthropic’s Claude 3 Haiku (currently the best value multimodal model), or an open-source model like LLaVA 1.6. For this particular project we found the open-source models were promising, but weren’t good enough yet and were actually slower than using OpenAI. The upside is that you have full control over an open-source model, and can fine-tune on your data to do a better job, and to avoid problems like refusals which are more common with mainstream LLM providers who have dominant safety teams.

Labels, labels, and more labels

Finally, once you have your AI-generated commentary, what do you do with it? In order to make the information in the commentary actionable, you need to summarize what’s important to look at—that’s where labeling comes in. Thankfully this can be done more cheaply, as GPT-3.5 can handle adding descriptive labels to text, and it is orders of magnitude cheaper. If you do this correctly, you can filter down to just the moments of the video where something interesting happens, like a combat sequence, or if a specific strategy was employed. However, although AI can come up with good labels, and can apply them fairly accurately and consistently, it’s not very good at knowing which labels are interesting.

Even though AI has saved you hundreds of hours by reviewing footage for you, it’s still a pain to have to review hundreds of labels to find which ones are interesting. You might think a potential solution would be to ask GPT-4 to give you back an “interesting score” for each label it finds, but LLMs are notoriously unreliable at numeric ratings, so this is still an unsolved problem.

Source: Author’s Screenshot.

It’s also fairly hard to determine whether the AI did a good job at predicting a label or not. A human needs to review parts of the footage to label it manually, then see if the AI gets the answers right. Once you know which labels should apply to which parts of a test video, you can try different prompting techniques or models to see how accurate they are at guessing the labels correctly. This type of eval benchmark works well for labels you know should apply to specific parts of the test video. But how do you check if it’s coming up with sufficient labels in the first place?

For example, if a human labeled a particular action in a video game as “active_combat” and the AI marked it up as “active_combat_engagement,” it is more or less correct but won’t match the human answer exactly, so cannot be automatically marked as correct without manual human review (which slows down testing). To solve this you need to use vector embeddings to calculate the similarity between the two labels, and mark it as correct if it’s above a certain percentage. 

Source: Author’s Screenshots.

The fact that I can feed a video into an AI system and have it tell me what happened is amazing, but it takes a lot of work to build that system in a way that works reliably at scale. Prompt engineering work can be frustrating, since it usually entails figuring things out for the first time— but that’s also what keeps it interesting. The wonderful thing about working in AI is that the rapid pace of development ensures that anything that barely works today will suddenly be viable in about three to six months—that is, whenever the next version of ChatGPT or one of its competitors is released. 

Watch everything that’s happening in your industry

The promise of AI vision is compelling: Never miss anything important that happens in your industry, because you have an army of AI agents watching everything for you! In order to fully realize that vision of the future, these models need to gain further reasoning ability so they can better understand the preferences and interests of their users. As is often true in AI, all of the hard engineering work we had to do to get this project working will soon be irrelevant, as Google and other providers are starting to support video inputs natively. We can also expect the quality of reasoning and understanding to improve as more powerful models are released, and processing costs go down.

Although most of the labels I got back from my own AI vision project were relatively shallow in terms of the insight they gathered, I did see flashes of brilliance. For example, in the game World of Tanks, the AI model I had coded noticed that players were all employing a similar tactic—my model called it “Strategic Repositioning”—wherein the players were using the terrain to their advantage, hiding behind hills and trees and popping up to shoot enemies, before retreating back to cover. You can’t get this sort of insight from traditional analytics or machine learning, because you need reasoning to connect the dots between multiple actions taken by the player. Knowing this is something players are doing could be valuable to a games publisher developing new features, a marketing team looking for moments of action to share on social media, or even a player trying to identify new strategies to improve their win rate.

Source: Author’s Screenshot.

This project also made me relatively comfortable about the future of human jobs. We still have a role to play even with AI doing most of the heavy lifting. Although in theory we could have hired a human to watch hundreds of hours of video footage, we never would have done this task at all without AI. It’s not as if the task was being taken away from humans, as it just wouldn’t have been economically viable otherwise (and too boring!). Being able to process hundreds of hours of footage for insight gave the client an upper hand in terms of the insights and knowledge they now have, at least until other teams adopt AI and do the same thing. Rather than replacing humans, it’ll create a productivity arms race where the stakes are higher given the productivity boost that comes with using AI, but we’re all still in the game.

The importance of competition and open source in AI was another thing that was striking in the course of this project, because relying solely on the OpenAI API for my work was frustrating, particularly when we were hit with low rate limits and high costs. When I built the proof of concept, my OpenAI account was limited to 100 API calls per day, and it was only after I spent $1,000 that I was approved to make the thousands of API requests I needed to complete the project. When I started this project, GPT-4V was the only multimodal model with enough brains to handle the task, but today you can switch effortlessly between OpenAI and Anthropic, or even an open-source alternative like LLaVA, which runs for free on my M2 mac through LM Studio, and actually gave fairly good results! 

Source: Author’s Screenshot.

Unlocking all of the information contained in video and making it searchable helps with the discovery problem of knowing which video to watch with your limited time and attention. AI’s ability to seamlessly translate information from one format to another, from images or audio to text and back again, enables every generation to access the best information in their preferred medium, regardless of how it was originally recorded. It’s remarkable to me how something that only came out in late 2023 is now available for free on consumer-level hardware in 2024.

AI vision wasn’t possible in 2014, and it took billions of dollars of research to get where we are today. However, only when open-source LLMs catch up to GPT-4 level of intelligence will profitable small businesses be built on top of AI. This has already occurred in the image-generation space with indie products like PhotoAI and ProfilePicture.ai, but LLMs have yet to have their “stable diffusion moment”: the point when an open-source model is released that is as good as the state-of-the-art model. Whether that’s LLaMA 3, the next Mistral model, or another vendor, when that happens with AI vision is when this long-awaited capability will truly begin to change the world.

December 21, 2023

More to read