Does teaching a LLM to say "No" help with hallucinations?
Enter R-tuning
For a long time, the standard (if overlooked) approach to training LLMs created a people pleaser dynamic; models were trained to predict the most statistically likely next word and heavily penalized for not answering a question or being unhelpful. The flaw in this type of training is that models learned that guessing something was “safer” than admitting it does not have enough knowledge to answer a certain prompt, which led to hallucinated answers.
If it were taught to say “I don’t know” when necessary, then it’s rational to assume that we will have far fewer hallucinations.
§
A (bit) more on hallucinations
Before we look into how that works— or whether it works at all— it’d be useful to understand the above case of hallucinations in depth.
Once viewed from a technical standpoint, hallucinations in LLMs make so much sense: a model is prompted with something that is outside its knowledge (the facts it memorized during pre-training), but it was instructed (or forced, in a way) to complete the sentence anyway. Lacking the actual data, the model attempts to put together what it knows and generates something plausible, which is essentially a confident fabrication.
§
Refusal-Aware Instruction Tuning (R-Tuning)
As it turns out, I was not the only one who thought teaching such LLMs to say “I don’t know” would help with hallucinations.
Shortly after ChatGPT was released, a team of researchers introduced R-Tunning, which stands for Refusal-Aware Instruction Tuning. It works in a rather thoughtful way: researchers test the pre-trained model to map out what it actually knows versus what it doesn’t. For the questions the model naturally gets wrong, the training data is modified to replace the target answer with a refusal, i.e., “I don’t know“.
By explicitly teaching the model to refrain from responding to queries it literally doesn’t know about, R-tunning does improve a model’s reliability on known facts and minimizes fabricated answers for unknown ones. To add to that, those researchers found that learning this kind of uncertainty provides a “meta-skill” that helps the model to estimate its own knowledge boundaries even on entirely new tasks without researchers mapping it out for the model.
Perhaps as expected, however, this comes with a caveat. Or we would have put a stop to hallucinations long ago.
§
The risk of over-refusals
When you teach an LLM the fact that it, in fact, can say “No” to prompts, we risk it refusing perfectly valid questions just to be safe.
Why this happens is also very understandable; during the training phases, an LLM’s knowledge base shifts drastically at times. A model might learn a new fact in a later stage of training that it didn’t know previously. But if it was already trained to say that “I don’t know” to that specific topic, it might keep refusing to answer, even when it does not possess the knowledge to answer it.
If the above scenario were resolved through some technique (such as R-tuning in the final stages of training), there are still other cases to worry about. The model may become overly cautious, generalizing its refusal reaction to topics it actually knows but isn’t mathematically certain about. Or in cases like persona-tunning, they might assume that refusing a certain number of prompts is a matter of personality, not knowledge itself.
§
Fixes and the answers
To fix such risks of over-refusals, researchers have come up with a little fix; instead of a simple “answer or refuse” algorithm, we can teach the models to understand different types of uncertainty through mapping out their knowledge into categories: things it is absolutely certain about, things it has a vague idea about, and things it knows it doesn’t know.
By fine-tuning this, the model learns to strike a balance. It learns to deliver confident, detailed answers when it has the cold, hard facts, and to gracefully decline—or ask for clarification—when it is genuinely out of its depth.
But again, it’s not all rainbows and sunshine. To answer our question at the title of this essay: Yes, it certainly helps with minimizing hallucinations.
Circling back, does teaching an LLM to say “no” actually fix hallucinations? Not really. LLMs are not databases where they retrieve stored, constant facts; they calculate and predict the most likely words, which is far from definitive, grounded logic. However, there is a certain potential to reach that state of near-zero hallucinations in LLMs at the cost of introducing some complex architecture that can dynamically map out what models know or don’t at every stage of training and inference, which certainly won’t be in the interest of generative AI companies. At least not given the current state of things.
