AI engineering is a new discipline, and everything is constantly changing. I built a simple script to automate some of my work with GPT-3 this time last year, and today I got an email telling me that model is being deprecated! In times of rapid change, it pays to zoom out and look for a set of principles, things that don’t change, because “you can build a business strategy around the things that are stable in time”, as Jeff Bezos says. This was the approach I took to creating my Udemy course and writing my O’Reilly book, because how else was I going to publish something on AI that didn’t go immediately out of date?
My work is in prompt engineering, but from my vantage point I can see a stable pathway coalescing for all of my clients, most of whom are “AI wrapper” companies – integrating AI in their products by calling the OpenAI API. GPT-4 is the best-in-class model, and you can get amazing results in some use cases with very little effort. However, this is a precarious initial position, because you have no moat: anyone can reverse-engineer your prompt and clone your app once you’ve done the hard work of customer discovery. That’s if ChatGPT doesn’t simply add your product as a feature. If you’re playing this game, how do you level up?
I put this visualization together to show the steps people take to go from prompt engineering to building an actual moat as an AI-powered business. I’ll talk through the tradeoffs at each stage, so you can use this as a strategic guide for your own product’s roadmap. From the easy ground of prompting theres dangerous hurdle of defining eval metrics, followed by slow, but steady progress up through many-shot examples, retrieval augmented generation (RAG), and eventually to chains (or the mystery box of autonomous agents!). As you move up the value chain the quality increases, but at an increase of cost until you fine-tune your own model.
Prompts
Every AI company starts as a prompt, and that initial discovery usually happens with the leading foundation model at the time – at present that’s GPT-4. Most of your prompting is likely for throwaway tasks you won’t do again, but occasionally you save a template when you find something valuable and repeatable that the AI does surprisingly well.
Usually I recommend companies productize this in a Google Sheets template, with a custom function to call the GPT-4 API. It let’s you iterate faster and work out the kinks, before you jump to code and incur the cost of building software. If you find your team using it consistently and getting good results, then it might make sense to turn it into a product, but don’t skip this step.
Evals
Once you go from prompting locally to running a prompt in production, you need a set of evals – metrics for measuring accuracy and performance. This usually takes the form of a set of questions to which you know the answers, called a ‘test set’, that you can use to grade the responses. It also helps to set up a system for ‘blind rating’ responses (giving thumbs up/down).
Evals are hard and time consuming, because the best ratings are done by humans, but humans are expensive and unreliable at scale. Of course, you can AI to automate this task too, and there is evidence GPT-4 is close to human-level at evals. The ideal situation is when you find an automated metric, like Levenshtein distance, you can calculate programmatically with no cost.
Many-Shot
It’s likely you already included examples of how to do the task in your prompt, because going from zero-shot (no examples) to few-shot (2-5 examples) or many-shot (5+) boosts performance significantly. However, providing more examples sometimes comes at the cost of less creativity – the AI leans too much on the examples – so must be tested against your evals.
Now that you have a clear set of evals, you have a way of surfacing only your best responses to then use in the prompt as examples for the AI to follow. Filtering for your worst responses, you can also identify edge cases and correct them, giving you more diverse positive examples to include, which can solve the lack of creativity issue.
RAG
It’s hard to find an AI product solving real-world use cases that doesn’t employ some form of RAG, or Retrieval Augmented Generation. This typically involves using a tool like a vector database or a search engine to ‘look up’ the relevant information and inject it as context into the prompt, increasing the accuracy of the response and reducing the risk of hallucination.
Vector databases are particularly powerful, because the search can return documents that are similar but not exact matches to the task at hand. This is done through vector embeddings, or strings of numbers that represent a location determined by the AI model: text or images that are similar will be ‘close by’. This is what powers all of the ‘read my files’ type AI product experiences.
Agents
Why take turns prompting and getting a response, when you could run AI in a loop and have it prompt itself? Having an AI agent plan, execute using tools, then evaluate its own work, shows great promise and should help us accomplish more complex tasks. The agent space has received a lot of hype, however, as any operator in the space will admit however, we’re not quite there yet.
Agents are relatively unreliable and often get stuck in a loop. Mistakes compound, so they can quickly cost you a lot of tokens without accomplishing anything. There’s also need for more sophisticate evals and monitoring tools if we’re going to give AI access to run code on our computer or execute tasks in the real world on our behalf.
Chains
The workhorse of AI applications, chaining is when the output of one prompt becomes the input of the next prompt, until the task is done. Dividing labor in this way makes evals much easier (you only have to eval one thing at a time). The advantage over agents is that you can specify and optimize each step to increase quality and reliability of the final output.
Chains can become parts of other chains, running in sequence or asynchronously in the background. Applications of arbitrary complexity can be built in this fashion, with each step open to inspection and monitoring, so that you can identify and fix problems. Often a mix of models can be used, as not every step needs to be done by the smartest (most expensive) model.
Fine-Tuning
This is where every AI startup with aspirations of VC funding wants to start, but it actually comes towards the end of the game. You can’t skip a step, because you need the usage data and feedback from evals to achieve good results from fine-tuning. It’s also a waste to jump right to custom models when you can get better results from prompting all they up to 2,000 datapoints.
Don’t fool yourself into thinking you can build a foundation model from scratch. You’d need experienced AI engineers, which cost $5-$10m per year each (and you still need to convince Nvidia to sell you an allocation of GPUs). Unless you have the resources to go up against Microsoft and Google, or you have this skill set in your founding team, stick to fine-tuning.