Back to posts

Battling Hallucinations in Prod

Spent a few months building digital twins for creators and it was fun trying to emulate the mannerism and diversity of human responses. (didn't quite stick as a business model, story for another post).

One of the major issues we faced in building this product was the concern from the creators (or anyone who was being cloned) that the digital twin might say something they wouldn't have and cause trouble for them. So naturally, we had to build very strict guardrails and verification mechanisms to ensure the responses are grounded in the knowledge base and minimize hallucination. Most of it was iterative i.e., we fixed things as and when they broke. Interestingly, stumbled upon this interesting paper from OpenAI [1] on why LLMs hallucinate and it was a nice validation of our efforts.

Why LLMs hallucinate?

It is common knowledge now that like students facing difficult exam questions, LLMs guess when uncertain instead of admitting "I don't know".

A brief and possibly over-simplified reason of why LLMs hallucinate is because of how they are evaluated; leaving answers blank scores zero, whereas guessing gives a shot at points. Extension of that, the pre-training data only guides the model to generate text that follows the rules of the language i.e., sentence structure, grammar, etc, and global well-replicated facts, but it doesn't have the labels for correct or incorrect data for low frequency events. More details on the "why" in the references.

So now that it has been sort of established that hallucinations are statistically hard to get completely rid of, the next question is how do we mitigate them in production. After reviewing the landscape and testing different approaches, we experimented with three clusters of techniques:

1. RAG + Retrieval Quality

Usually the most common cause of hallucinations: the model doesn't have the right information at inference time so it defaults to making things up. RAG and subsequent post-processing steps fixes this by forcing it to quote reality instead of its priors.

Naive RAG isn't enough ofc:

// Naive RAG (what we started with)
const context = await vectorSearch(query);
const response = await llm.chat([
{ role: "system", content: context },
{ role: "user", content: query }
]);
// Problem:
// - Retrieves irrelevant docs (low precision)
// - Misses relevant docs (low recall)
// - Model gets "lost in the middle" of long context

The production version needed multiple layers:

// Production RAG Pipeline
class ProductionRAG {
async retrieve(query: string) {
// Step 1: Boost recall
const expandedQueries = await this.expandQuery(query);
const hydeDoc = await this.generateHypotheticalDoc(query);
// Step 2: Sparse + dense search
const bm25Results = await this.bm25Search(expandedQueries);
const vectorResults = await this.vectorSearch([query, hydeDoc]);
// Step 3: Boost precision
const reranked = await this.rerank([...bm25Results, ...vectorResults]);
const compressed = await this.contextualCompress(reranked);
// Step 4: Handle position bias
const structured = this.packContext(compressed); // Avoid "lost in middle"
return {
context: structured,
citations: this.extractCitations(compressed)
};
}
async expandQuery(query: string) {
// "marathon training" → ["marathon training", "running preparation", "endurance training"]
return await this.llm.expand(query);
}
async generateHypotheticalDoc(query: string) {
// HyDE: generate what a good answer would look like, use it for search
return await this.llm.complete(
`Write a paragraph that would answer: ${query}`
);
}
rerank(docs: Doc[]) {
// Cross-encoder rescoring - much more accurate than pure vector similarity
return this.crossEncoder.score(docs);
}
contextualCompress(docs: Doc[]) {
// Remove irrelevant sentences, keep only what's needed
return this.compressor.compress(docs);
}
}

Each sub-component is supposed to address a different failure mode:

  • Multi-query + HyDE: Reduces "no relevant docs" cases
  • Hybrid search: string matches and semantic search
  • Reranking: self-explanatory
  • Compression: Removes noise that confuses the model
  • Context packing: Puts important info at start/end (models pay less attention to middle)

All of this was iterative and was added as a paranoid measure for each failure mode. :)

2. Output Verification

Even with perfect RAG, models still make stuff up. The trick is to not trust the first answer.

// Chain-of-Verification (CoVe)
async function chainOfVerification(query: string, context: string) {
// Pass 1: Generate answer
const answer = await llm.chat([
{ role: "system", content: context },
{ role: "user", content: query }
]);
// Pass 2: Generate verification questions
const verificationQuestions = await llm.chat([
{ role: "system", content: "Generate verification questions for key claims" },
{ role: "user", content: `Answer: ${answer}
Context: ${context}` }
]);
// Pass 3: Answer verification questions
const verifications = await Promise.all(
verificationQuestions.map(q =>
llm.chat([
{ role: "system", content: context },
{ role: "user", content: q }
])
)
);
// Pass 4: Reconcile and correct
const finalAnswer = await llm.chat([
{ role: "system", content: "Reconcile original answer with verifications" },
{ role: "user", content: JSON.stringify({ answer, verifications }) }
]);
return finalAnswer;
}

This works because models are better at criticizing bad answers than avoiding them initially. It's easier to spot "wait, the context says 2024, not 2025" than to generate the right year in the first pass.

Alternative approach - self-consistency:

// Generate N answers, flag disagreements
async function selfConsistencyCheck(query: string, context: string, n = 5) {
const answers = await Promise.all(
Array(n).fill(null).map(() =>
llm.chat([
{ role: "system", content: context },
{ role: "user", content: query }
], { temperature: 0.7 }) // Higher temp for diversity
)
);
// Extract key facts from each answer (another LLM call - adds latency)
const facts = await Promise.all(
answers.map(a => extractFacts(a))
);
// Check agreement
const agreement = calculateAgreement(facts);
if (agreement < 0.7) {
return {
answer: null,
confidence: "low",
reason: "High disagreement across samples"
};
}
// Return most common answer
return {
answer: getMostCommon(answers),
confidence: "high",
agreement
};
}

If the model gives different answers with slight prompt variations, it's probably guessing. Inconsistency correlates strongly with hallucination.

Latency Overhead It was easy to come up with the entire cohort of solutions but as expected it added a lot of latency. which was not acceptable for our real time streaming avatars usecases. Other than picking and choosing solutions for our usecase, a few systemic optimizations also helped.

  • Parallelize - Most obvious. run dense and sparse search concurrently, send verification questions in parallel rather than sequentially
  • Model Tiering - use GPT-4o-mini for query expansion and HyDE, save expensive models only for final answer generation
  • Optimistic Streaming - stream the answer immediately while running verification in background, add "Low Confidence" badges if verification fails (still in my TODO)
  • Semantic Caching - cache verified answers in Redis, serve instantly for similar queries
  • Speculative RAG - generate from model knowledge while fetching documents, inject corrections if contradictions found

We used parallelization and model tiering in production. Streaming felt jarring for our use case (creative responses), and we didn't have enough traffic to justify caching infrastructure.

3. Guardrails & Structured Output

This doesn't improve correctness as much as #1 and #2, but it limits blast radius. When models do hallucinate, make sure it fails gracefully.

// Structured output with citations
const response = await llm.chat(messages, {
response_format: {
type: "json_schema",
json_schema: {
name: "grounded_response",
schema: {
type: "object",
properties: {
answer: { type: "string" },
confidence: {
type: "string",
enum: ["high", "medium", "low", "unknown"]
},
citations: {
type: "array",
items: {
type: "object",
properties: {
claim: { type: "string" },
source: { type: "string" },
quote: { type: "string" }
},
required: ["claim", "source"]
}
}
},
required: ["answer", "confidence", "citations"]
}
}
}
});
// Validate response
if (response.confidence === "low" || response.citations.length === 0) {
return "I don't have enough information to answer that confidently.";
}
// Verify each citation exists in context
const verified = response.citations.every(c =>
context.includes(c.quote)
);
if (!verified) {
return "I cannot verify all claims against the provided context.";
}

The structured output forces the shape of the response:

  1. Explicitly state confidence
  2. Provide citations for each claim
  3. Quote specific text from context

Then you validate those citations actually exist (faithfulness to the source check). If they don't, reject the response.

Bringing it all together:

async function robustLLMQuery(query: string) {
// Layer 1: RAG (Primary defense)
const { context, citations } = await productionRAG.retrieve(query);
// Layer 2: Verification (Secondary defense)
const answer = await chainOfVerification(query, context);
// Layer 3: Guardrails (Tertiary defense)
const validated = await validateWithCitations(answer, context);
if (!validated.isValid) {
return {
answer: "I don't have enough verified information to answer that.",
confidence: "none",
citations: []
};
}
return {
answer: validated.answer,
confidence: validated.confidence,
citations: validated.citations
};
}
async function validateWithCitations(answer: any, context: string) {
// Check confidence
if (answer.confidence === "low") {
return { isValid: false };
}
// Verify citations exist in context using fuzzy matching
const allCitationsValid = answer.citations.every(c => {
// Simplified bag-of-words check. In prod, use N-gram overlap or sequence matching
const quoteTokens = c.quote.toLowerCase().split(/s+/);
const contextTokens = context.toLowerCase().split(/s+/);
const overlap = quoteTokens.filter(t => contextTokens.includes(t)).length;
return overlap / quoteTokens.length > 0.8; // 80% token overlap threshold
});
if (!allCitationsValid) {
return { isValid: false };
}
// Check semantic similarity of answer to context
const similarity = await semanticSimilarity(answer.text, context);
if (similarity < 0.7) {
return { isValid: false };
}
return {
isValid: true,
answer: answer.text,
confidence: answer.confidence,
citations: answer.citations
};
}

This was enough for the domain we were targeting, which is creative output, so even though perfect was still out of reach, it was good enough for creators.

Next Steps (In Alternate Timeline) If we had continued on this product, these were on my list.

For systematic optimization, we'd need an evaluation dataset with labeled QA pairs and correct documents. This unlocks two capabilities:

Component-Level Metrics

  • Track retrieval recall (% of relevant docs retrieved) and precision (% of retrieved docs that are relevant)
  • Measure verification catch rate (% of hallucinations caught) and false positive rate
  • Monitor citation accuracy and confidence calibration
  • Identify exactly where the pipeline breaks

DSPy

  • Replace manual prompt tweaking with algorithmic optimization
  • Systematically explore prompt variations
  • Optimize entire pipeline configurations without retraining the model

Both require ground truth to measure against. Building that eval set is the prerequisite work before either becomes useful.

Some of the code samples were generated to present a simplified version of the actual impl.


References:

Contact