Building your own AI assistant like Iron Man

April 4, 2023
Michael Taylor

Most people assume that new inventions occur as “Eureka!” moments had by scientists in lab coats, which are then commercialized and marketed to the public.

In reality, a remarkable array of new inventions first appeared in the minds of science fiction writers, who possessed little or no scientific expertise.

Behind every breakthrough technology you’ll find a science fiction fan earnestly trying to bring the stories from their childhood to life. 

For example:

  • Jules Verne “the father of science fiction” is credited with inspiring Simon Lake to invent the submarine, and Igor Sikorsky to invent the helicopter.
  • Douglas Adams Hitchhiker’s Guide to the Galaxy inspired Babelfish, a precursor to Google Translate, as well as IBM’s supercomputer Deep Blue.
  • The dystopian novel Snow Crash, is responsible for Google Earth and the ‘Metaverse’, that Meta (née Facebook) has funneled billions of dollars into.

Many of the products we know and love, from the iPhone to Uber, were quite literally memed into existence

Of all the Sci-Fi tropes, the one gathering the most steam in recent years is that of an “AI assistant”. Their job is to assist their owner with any sort of task, such as hacking, monitoring, analyzing and searching.

There are many examples in popular culture – HAL in 2001: A Space Odyssey, Joi in Blade Runner 2049, Cortana in Halo – but the first one most people think of today is JARVIS; the benevolent AI assistant of Tony Stark, a.k.a Iron Man.

How studying AI ruins Marvel movies for you: Tony Stark's J.A.R.V.I.S. -  clash of geek and science
https://www.linkedin.com/pulse/how-studying-ai-ruins-marvel-movies-you-tony-starks-jarvis-feroli/

This always felt laughably far in the future, until OpenAI released GPT-3 in June 2020 and I got to play around with the beta version.

However it wasn’t until AI exploded into mainstream usage with 1 million people using ChatGPT within a week of its release in November 2022, that I noticed regular people starting to get interested and ask me about AI.

benefits of chatgtp
https://www.softwebsolutions.com/resources/everything-you-want-to-know-about-chatgpt.html 

Recently as I was doing some prompt engineering, playing around with Langchain (a tool for orchestrating AI tasks) and Pinecone DB (a vector database AI can use as memory), I realized that all of the pieces to build your own AI assistant like JARVIS were in place.

*** I've actually released a prompt engineering course on Udemy with my business partner James Phoenix, so you can get stuck into this right away***

Not only that, but with the open-source Stable Diffusion image model, and Meta’s “leaked” ChatGPT competitor LLaMA, it might actually be possible to get the whole thing running locally on your computer for free (if you have an M1 / M2 Mac or PC with a GPU).

I prescribe to the Mark Watney school of decision making, so I figure why not try to build my own and see where I get to? I’m sharing this as a template for anyone else who wants to be Iron Man, in the hopes that we can share tips on how to improve it.

Why Build Your Own AI Assistant?

Why build your own? Well your AI assistant deals with lots of sensitive information that you probably want to avoid sharing, and their content moderation policies are evolving rapidly as they get dunked on by the popular press. You don’t want to become reliant on AI for your work and life, only for it to be hacked or censored.

It’s also just nice to have control over the user interface and functionality, so you can adapt your AI assistant to whatever use cases you need, without worrying about when OpenAI is going to change things or let you off the waitlist. I’ve already personally had content moderation issues, being banned for 10 days because of a cartoon mouse.

https://twitter.com/hammer_mt/status/1580467310326091776?s=20

What’s the Architecture of Our AI Agent?

All of the pieces are already available to build what’s damn near to an AI assistant portrayed in Iron Man and wide popular culture. To be clear you're not required to "train" your own model, or understand AI: all you need is a bit of prompt engineering and some code to stitch these various packages together.

To be useful we need to give our AI agent a functioning long term memory, the ability to “see” images, “hear” audio, and “speak” back to us. We also need a guiding thought → action → observation loop, as well as the ability to equip the AI with tools to use. It might sound like science fiction, but all of this is already possible with off-the-shelf, open-source tools, available today.

Chat UI

We’ll start with the chat interface, which is easy because it’s basically a glorified text form. Chatbot UI has already cloned the ChatGPT experience using Next.js, TypeScript, and Tailwind CSS, frameworks I’m already familiar with.

Chatbot UI
https://github.com/mckaywrigley/chatbot-ui

Reason and Act

The thought → action → observation loop is also surprisingly easy to do, using another open-source library called Langchain. They implement something called “Reason and Act”, or “ReAct”, which is essentially just a clever prompt for getting the AI to think through actions before making them. This is also the key to tool use, as the actions can be programmed as a kind of app store for AI, or “Plugins” as OpenAI calls them, of which Lanchain has an open-source version.

https://openai.com/blog/chatgpt-plugins

Long Term Memory

For memory there are two types; short term memory which is just the context window of the chat, and long term memory in the form of a vector database. Vector databases let us look up documents based on similarity, so for example “a mouse ate some food” would match to a search for “cheese”, because mice and cheese appear together often, even though the word we searched for isn’t in the document. LlamaIndex (formerly GPTIndex) helps us with this task, paired up with Langchain to do the coordination and retrieval. 

semantic_vector-170
https://blog.griddynamics.com/semantic-vector-search-the-new-frontier-in-product-discovery/

Vision, Audio, Speech

How do we teach our AI to “see”, “hear”, and “speak”? Again we have readily available solutions for all of these seemingly impossible tasks. CLIP is an open-source component used by image generation models DALL-E and Stable Diffusion, which can take any image and reverse-engineer the caption as text. Whisper is an open-source audio transcription library, released by OpenAI. Finally the speech is handled by “Sovits”, short for “Soft-VC and VITS”, which was behind a pretty impressive Kanye West demo.

https://twitter.com/LinusEkenstam/status/1639937115147444224?t=SiN6eR5ePhu6B3s4jScGhQ&s=19

Model Inference

We could build all of this infrastructure and get back an impressive amount of flexibility and control over our AI assistant. We can easily replace DALL-E’s image generation capability with open-source Stable Diffusion, and honestly we’d want to anyway because it’s far better. However we’d still be reliant upon OpenAI for the actual AI itself; we’d have to call GPT-4 for every chat message. That means we’re still stuck at the mercy of their content moderation policy, and are still sending sensitive data to their servers. 

However there’s one final development that brought a fully open-source AI stack within reach; somebody leaked Meta’s LLaMA model! Let’s be clear, it’s not as good as ChatGPT, and definitely not within sight of GPT-4, but it’s getting better quickly. Shortly after the leak, a group at Stanford fine-tuned LLaMA with data generated by OpenAI’s GPT-3 for only $600, and released “Alpaca”, which is tantalizingly close to the performance you get from ChatGPT. Even better, there’s an open-source library, “alpaca.cpp”, that lets you run it on your local devices for free. 

A screenshot of LLaMA 7B in action on a MacBook Air running llama.cpp.
https://arstechnica.com/information-technology/2023/03/you-can-now-run-a-gpt-3-level-ai-model-on-your-laptop-phone-and-raspberry-pi/

Building the AI Agent

Honestly I haven’t gotten this far yet; this is just an idea that came to me that I thought would be interesting to work on. I have built most of these things in isolation, but it didn't occur to me to stitch them all together until recently.

As I get parts of this done I’ll update this part of the post so you can follow along and build your own. I'll open-source the whole thing so that you can clone it and deploy your own without understanding the underlying code.

The space is changing so much right now that I know for sure my technology and tool choices, as well as the prompts I use, will quickly become out of date. If you spot anything that could be improved, please let me know.