During a talk at dev day, OpenAI claimed their new assistant product had achieved 98% accuracy when answering questions about documents:
Credit: Ben Parr on Twitter
That’s a huge claim to make! Performance of other Q&A systems like LlamaIndex tend to hover at around 80% in my testing. The products you can build around an 80% accurate system and a 98% accurate one are wildly different.
Strong claims require strong evidence, so I turned some of our tooling (and API credits) to the task of evaluating the new assistant system. We won’t get fooled again!1
I want to test across a handful of document lengths and styles, so I grabbed a couple of ebooks of different lengths2. This isn’t going to be super rigorous – the different styles of writing here are going to affect the results.
For each of these documents I created ~200 questions3. Here are some samples across the different books:
A Modest Proposal:
“What is the weight of a child at birth and after a year as estimated by Swift?”
Metamorphosis:
“What did Gregor's father do for a living after the collapse of his business?”
The Science of Climbing Training:
“Why is static or passive stretching not recommended before or during training?”
The Picture of Dorian Gray:
“What is Dorian Gray doing when Lord Henry and Basil first see him in the studio?”
Grimm’s Fairy Tales:
“What did Snow-white and Rose-red find the dwarf caught in when they first encountered him?”
Along side each question we save the passage that answers the question and a reference answer. For example:
Question: “What did Snow-white and Rose-red find the dwarf caught in when they first encountered him?”
Answer: “His beard was caught in a crevice of a tree”
Source Passage: “A short time afterwards the mother sent her children into the forest to get firewood. There they found a big tree which lay felled on the ground, and close by the trunk something was jumping backwards and forwards in the grass, but they could not make out what it was. When they came nearer they saw a dwarf with an old withered face and a snow-white beard a yard long. The end of the beard was caught in a crevice of the tree, and the little fellow was jumping about like a dog tied to a rope, and did not know what to do.”
This is enough information for us to grade the answers with an LLM. Our grading isn’t perfectly accurate, but it’s competitive with humans.
After burning around $200 of OpenAI credits on generating and grading questions, here are the results:
The only example that matches OpenAI’s 98% claim is “A Modest Proposal” – by far the shortest document in the set, and one of the most extensively analyzed pieces of literature in the world4! It would be hard to find an easier document to ask questions about.
Visually it looks like the quality declines with document length, but let’s graph that directly:
We can’t make really strong claims with only five data points, but on the surface it looks like accuracy goes down with longer documents. Our worst performing document, The Picture of Dorian Gray, dips down to 78% accuracy5! This probably isn’t accurate enough for most applications – you’d fire a customer support rep that was this bad, let alone a doctor or a lawyer.
I’m not going to get into a deep analysis of the failure modes, but I did notice a concerning trend: OpenAI _always_ produces an answer, even when it has clearly not retrieved the right information.
Whether or not the accuracy continues to get worse with even longer documents, it’s clear that the 98% accuracy claim made by OpenAI does not hold up on documents longer than a few pages. These questions are the simplest possible tests – they ask direct questions about facts from the book. There’s no synthesis, analysis, or reasoning required to answer the questions. If there’s a problem set that produces 98% consistently, it must be either very simple or way overfit.
This isn’t to say that the performance is bad! It outperforms other state of the art approaches from LlamaIndex and LangChain in my testing6 – but it’s nowhere near the revolutionary 98% accuracy claimed. OpenAI has released a well implemented RAG system built on existing techniques. To hit 98% accuracy on real scenarios, we’re going to need more than that.
Maybe at next year’s dev day.
Can you guess which one of these I already owned?
The exact number varies because of a post processing step that removes low quality questions like “What is the tax identification number of the Project Gutenberg Literary Archive Foundation?” It’s technically a valid question answered in the doc since they all have a gutenberg library header page, but we want to remove these in the spirit of the test.
I could also only get 55 unique questions from A Modest Proposal, because it’s so short.
There’s even an analysis written by me for a middle school English assignment. Fortunately this is probably crumpled up in a landfill somewhere, and never made it into any training datasets.
Why is performance worse on Dorian Gray than the longer Grimm’s Fairy Tales? I think this has to do with the style of writing making retrieval more difficult. “What is Dorian Gray doing when Lord Henry and Basil first see him in the studio?” has very little information that’s useful for retrieving the document – all three characters, as well as the studio, are talked about throughout the document.
On the other hand, any mention of “Snow White” is going to get you (at a minimum) to the correct chapter. A more fair test here would use increasingly long segments of the same doc, instead of using different books.
I think the climbing book benefits from this phenomenon too, since many of the chapters have distinct identifying characters or themes that make retrieval much easier.
More on that in later blogposts.