Hallucinations, Fact checking, Entailment and all that. What does it all mean?

Introduction

Generative AI, particularly Large Language Models (LLMs), has revolutionized content creation. However, such models sometimes suffer from a phenomenon called hallucination. These are instances when LLMs generate content that is made up, incorrect, or unsupported. To understand the nuances of this problem, let’s explore some scenarios:

  1. Creative Writing: You ask a model to generate a story for you. Is the model hallucinating? Technically, yes, but this is what you want!

  2. Coding: You want the model to explain how to use a Python function, and in the output, it gives you a novel example and calls functions that don’t exist in the current environment. It is partially hallucinating.

  3. Factual Queries: You ask a model to say what episodes of a TV show an actor appeared in, and it generates an incorrect list. In this case, the model generates incorrect information, undoubtedly generating what we call hallucinations. 

The first two are creative hallucinations that can be useful, depending on the application. The last one is a factual hallucination, which is unacceptable. We will discuss this kind of hallucination in the rest of the post.

Grounded factuality

Factuality refers to the quality of being based on generally accepted facts. For example, stating “Athens is the capital of Greece” is factual because it is widely agreed upon. However, establishing factuality becomes complex when credible sources disagree. Wikipedia gets around such problems by always linking a claimed fact with an external source. For example, consider the statement: 

“In his book Crossfire, Jim Marrs gives accounts of several people who said they were intimidated by the FBI into altering facts about the assassination of JFK.”

The factuality of the statement “FBI altered facts about the assassination of JFK” is not addressed by Wikipedia editors. Instead, the factual statement is about the content of Marrs' book, something that can be easily checked by going directly to the source.

This leads us to a useful refinement of factuality: grounded factuality. Grounded factuality is when the output of a LLM is checked against a given context provided to the LLM.

Consider the following example: While Athens being the capital of Greece is a factually correct statement, from 1827 to 1834 the capital was the city of Nafplio. So, in a historical context about Greece at that time, stating that Athens was the capital in 1828 would have been a hallucination.

In grounded factuality, the source of truth is given in a context document. Then, the factuality of a claim is checked with respect to that given context. In the NLP and linguistics literature, this problem is also called textual entailment.

At Bespoke Labs, we have built a small grounded factuality detector that tops the
LLM AggreFact leaderboard. The model detects within milliseconds whether a given claim is factually grounded in a given context and can be tested at playground.bespokelabs.ai (HuggingFace link).

The Bespoke-Minicheck-7B model is lightweight and outperforms all big foundation models including GPT 4o and Mistral-Large 2 for this specialized task. Bespoke-Minicheck-7B is also available by API through our Console.

Importance of Grounded Factuality

Grounded factuality is particularly important for RAG systems, which rely on context to answer questions. RAG has become popular because it reduces hallucinations. However, just deploying RAG doesn’t necessarily mean that hallucinations are entirely eliminated. For example, it was recently found that in the legal setting, RAG-based AI research tools hallucinate 17% to 33% of the time, contrary to claims that RAG systems are “hallucination-free”.

For RAG, enforcing grounded factuality means that the outputs are faithful to the content of the retrieved source documents.

How LLM fact-checkers work

LLM fact-checkers involve one or more documents, which serve as the context to be labeled, and a claim. This claim could be a single sentence or a long response from an LLM. Checking a claim involves figuring out what to check and then checking each piece against the relevant documents.

Figuring out what to check is simple with Bespoke-MiniCheck: one can check each sentence individually. This is simple and works well in practice. Other methods use a separate claim decomposition step (e.g., WiCE, FActScore, Factcheck-GPT, RAGAS faithfulness]. This can be done with an extra call to an LLM, but this increases latency, and we haven’t found this to improve performance substantially.

Checking each piece of the claim is a textual entailment problem, which is framed as classification: a claim is either supported or unsupported by a context (note: some models differentiate “contradicted by” from “unsupported” as a third class). With this model, we can obtain either hard decisions or soft “support scores” for each part of a claim based on the class probabilities. Bespoke-Minicheck can output either of these.

Using Bespoke-MiniCheck

Get your API key at console.bespokelabs.ai, and the usage is straightforward with the FastAPI interface. You can quickly try an example with this colab (see this for additional documentation).

By entering all claims in your LLM’s response and averaging the probabilities, you can compute a single MiniCheck score that can be used like
RAGAS’s faithfulness score. Alternatively, by summing up the predicted labels, you can estimate how many sentences in the LLM’s output are not factually grounded.

For additional questions, please contact us!