ChatGPT doesn't respect word count

September 7, 2023
Michael Taylor

The most common mistake I see in prompts for writing content with ChatGPT is specifying the character count. Not to pick on anybody, but here’s an example from Content Bot, who rank number 1 for “blog writing prompts chatgpt”:

```Create a blog post about “{blogPostTopic}”. Write it in a “{tone}” tone. Use transition words. Use active voice. Write over 1000 words. Use very creative titles for the blog post. Add a title for each section. Ensure there are a minimum of 9 sections. Each section should have a minimum of two paragraphs. Include the following keywords: “{keywords}”. Create a good slug for this post and a meta description with a maximum of 100 words and add it to the end of the blog post.```

This doesn’t work because LLMs are surprisingly bad at math, so they can’t actually count character or word length. From experience they do get the general ‘vibe’ right, as in a 1000 word request tends to be much longer than a 500 word request, but I’ve seen in my own work with AI that ChatGPT consistently fails to deliver the right number of words.

Results

I’m putting the punchline up front, for those of you who just want to see the stats and move on. For those that are interested in how I tested this, you can keep going after reading this section for more detail.

Across 30 observations per prompt (3 test cases x 10 runs each), we can see that ChatGPT is always off on wordcount by a significant amount. 

  • When asked for 10 words, it gets over excited and writes about 10x too much. 
  • With short form blog content it over shoots by 10-30%
  • For long form content it consistently falls short, struggling to make it past 600-700 words.

If setting a word count isn’t the right way to work with ChatGPT to write content, then what is? I find better results asking for a number of paragraphs or lines, something that is easier to get directionally right, with less penalty for being wrong. In my AI writing system I ask it to write a section at a time, sticking to two paragraphs, which it reliably does a good job at. Dividing Labor is one of my Five Principles of Prompting, and it’s the one that non-technical prompters are most likely to get wrong – they try to do everything in one long prompt, instead of breaking the task up into multiple steps then chaining them together.

If you are technical, another approach might be to programmatically check the character count and if it falls short or is too long, and recursively reprompt with instructions to expand or summarize until it’s within desired bounds. Either way, monitoring how often your system is going above or below word count is an important way to spot anomalies and improve the quality of your results and user experience.

Methodology

Because I just finished open-sourcing a simple prompt testing library, `thumb`, I decided this would be a great way to give it a spin, and prove once and for all that dictating a word count to ChatGPT is a noob move. 

I span up a test with thumb in a Jupyter Notebook, which was as easy as this:

```
import thumb

prompt_a = "write 10 words on {topic}"
prompt_b = "write 300 words on {topic}"
prompt_c = "write 500 words on {topic}"
prompt_d = "write 1000 words on {topic}"
prompt_e = "write 2000 words on {topic}"

cases = [{"topic": "memetics"}, {"topic": "value-based pricing"}, {"topic": "the skyscraper technique"}]
test = thumb.test([prompt_a, prompt_b, prompt_c, prompt_d, prompt_e], cases, 10)
```

Under the hood I’m using Langchain to call ChatGPT, which gives me access to tracing tokens and retry logic right out of the box. Note: token limits by default aren’t set in Langchain chat models as far as I’m aware, but if anyone knows different let me know as it would affect the test.

This test generated 5 x 3 x 10 = 150 calls to OpenAI’s `gpt-3.5-turbo` model, because it’s 5 prompt templates multiplied by 3 test cases (different topics to write about), and 10 runs for each combination. The total cost was about a dollar, and it ran for about 1 hour and 20 mins.

It’s important to run the same prompt and test case combination multiple times because the results from ChatGPT are deterministic. You might get lucky and get a rare good response when testing, or unlucky and get a few duds in a row. Only by running it 10 times across 3 scenarios or test cases, will you start to get a fair comparison between prompts. I’ve written about prompt optimization before, and I’ve used a similar approach in testing AI writing styles

The default in the `thumb` library would be to rate the prompts with thumbs up / down (something I call ‘thumb testing’), but I didn’t need that part because I didn’t care about the quality of the results, just the word count. So I aborted by interrupting the kernel, and exported the responses to CSV with `test.export_to_csv(“wordcount.csv”)`.

That gave me the following spreadsheet, to which I added a word count and calculated the difference between the words requested and how many were delivered. For that I used Pandas, a python library for spreadsheet style operations, but you can just as easily do it directly in Excel or GSheets. Counting the words was done by splitting the Content (response from ChatGPT) by the space character, and counting the number of words.

Feel free to make a copy of the spreadsheet and do your own analysis, or use `thumb` to run a similar experiment of your own! If you find anything interesting, share with me @hammer_mt.

More to read