{"id":1190337,"date":"2026-05-11T17:17:41","date_gmt":"2026-05-11T16:17:41","guid":{"rendered":"https:\/\/staging-artefact.digitalseeder.com\/?post_type=blog&#038;p=1190337"},"modified":"2026-05-11T17:32:03","modified_gmt":"2026-05-11T16:32:03","slug":"detecting-hallucinations-in-llms-one-token-at-a-time","status":"publish","type":"blog","link":"https:\/\/staging-artefact.digitalseeder.com\/fr\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/","title":{"rendered":"D\u00e9tection des hallucinations dans les LLM, un token \u00e0 la fois"},"content":{"rendered":"<p>How entropy-based scoring can tell you when your model is making things up \u2014 and where \u2014 wrapped up in artefactual, our Python package.<\/p>\n<p>Note: This article is a follow-up to <a href=\"https:\/\/medium.com\/ardian-data-science\/is-your-ai-lying-to-you-04e59ac61fff?postPublishedType=initial\">the article from our friends at Ardian<\/a>, detailing how crucial accountable AI is for financial institutions. Make sure to check it out!<\/p>\n<h2>The hallucination problem<\/h2>\n<p>Les grands mod\u00e8les linguistiques sont \u00e9tonnamment capables. Ils r\u00e9sument, traduisent, raisonnent et codent (mieux que moi). Mais contrairement \u00e0 moi, ils sont \u00e9galement devenus c\u00e9l\u00e8bres pour avoir invent\u00e9 des faits avec une confiance d\u00e9concertante.<\/p>\n<p>In the Natural Language Processing (NLP) literature, a hallucination is any model-generated content that is factually incorrect, nonsensical, or unfaithful to a provided source, while appearing perfectly plausible. The consequences range from benign (a wrong trivia answer) to severe (a fabricated legal citation, an incorrect drug dosage). As organizations integrate LLMs into production systems, the question shifts from <em>\u201ccan this model generate useful text?\u201d<\/em> \u00e0 <em>\u201ccan we trust what it just said?\u201d<\/em><\/p>\n<p>Consider a concrete example. You work at a financial institution, and you ask your local LLM:<\/p>\n<blockquote><p>\u201cWhat was Emerson Electric\u2019s net revenue in 2023?\u201d<\/p><\/blockquote>\n<p><img decoding=\"async\" class=\"lazyload aligncenter wp-image-1190466\" src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1.png\" data-orig-src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1.png\" alt=\"\" width=\"700\" height=\"185\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27700%27%20height%3D%27185%27%20viewBox%3D%270%200%20700%20185%27%3E%3Crect%20width%3D%27700%27%20height%3D%27185%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E\" data-srcset=\"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1-200x53.png 200w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1-300x79.png 300w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1-400x106.png 400w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1-600x158.png 600w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1-768x203.png 768w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1-800x211.png 800w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1-1024x270.png 1024w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-1.png 1045w\" data-sizes=\"auto\" data-orig-sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>The model replies: <em>\u201cEmerson Electric reported a net revenue of approximately $15.2 billion for fiscal year 2023.\u201d<\/em> Sounds reasonable. But is it right? You don\u2019t have the annual report open. You don\u2019t have a ground truth to compare against. You just have the model\u2019s output \u2014 and doubt.<\/p>\n<p>This is the setting we work in. No oracle. No reference answer at inference time. Just an LLM response and the metadata it produces while generating it. The goal: quantify how likely this output is to be hallucinated, from a single generation pass.<\/p>\n<h2>Detecting hallucinations: it\u2019s harder than it sounds<\/h2>\n<h3>The brute-force approach<\/h3>\n<p>One natural idea is to ask the model the same question several times and check whether the answers agree. If five out of six runs say \u201c$15.2 billion\u201d and one says \u201c$18.7 billion\u201d, the consensus gives you some confidence. This is the principle behind methods like SelfCheckGPT, which measure consistency across multiple sampled outputs \u2014 a \u201cMonte Carlo-style\u201d approach to hallucination detection.<\/p>\n<p>It works. But it comes with two significant drawbacks:<\/p>\n<ol>\n<li><strong>Cost.<\/strong> Each additional generation multiplies your inference budget. For SelfCheckGPT with 10 samples, you pay roughly 10x the compute, plus the cost of a semantic similarity model on top. At scale, this is prohibitive.<\/li>\n<li><strong>Granularity.<\/strong> Multi-shot methods operate at the sequence level. They tell you \u201cthis answer seems unreliable\u201d, but not which part of the answer is problematic. A response might be 90% accurate with a single hallucinated figure buried in the middle. You\u2019d like to know where.<\/li>\n<\/ol>\n<p>These limitations motivated us to look for a different signal \u2014 one that is cheap, single-shot, and works at the token level (the individual pieces of words the LLM manipulates internally).<\/p>\n<h3>The signal is already there<\/h3>\n<p>When an LLM generates text, it doesn\u2019t just output tokens. At each step, it computes a probability distribution over its entire vocabulary: <em>\u201cgiven the prompt and everything I\u2019ve generated so far, how likely is each possible next token?\u201d<\/em>\u00a0The winning token gets sampled. The rest are discarded. But those probabilities (and more specifically, how spread out they are) carry information about the model\u2019s internal confidence.<\/p>\n<p>If the model is very sure, most of the probability mass concentrates on a single token. If the model hesitates, the probability spreads across many candidates. This spread is exactly what entropy measures.<\/p>\n<h3>Entropy: a quick detour<\/h3>\n<p>Entropy is an information-theoretic quantity that measures the uncertainty of a probability distribution. The intuition is straightforward. Imagine three boxes. One contains a cookie. You have to guess which one.<\/p>\n<p><img decoding=\"async\" class=\"lazyload aligncenter wp-image-1190467\" src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2.png\" data-orig-src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2.png\" alt=\"\" width=\"800\" height=\"404\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27800%27%20height%3D%27404%27%20viewBox%3D%270%200%20800%20404%27%3E%3Crect%20width%3D%27800%27%20height%3D%27404%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E\" data-srcset=\"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2-200x101.png 200w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2-300x152.png 300w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2-400x202.png 400w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2-600x303.png 600w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2-768x388.png 768w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2-800x404.png 800w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-2.png 982w\" data-sizes=\"auto\" data-orig-sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/p>\n<ul>\n<li><strong>Scenario A:<\/strong> You know the cookie is in box 2. Your uncertainty is zero. Entropy = 0.<\/li>\n<li><strong>Scenario B:<\/strong> You have no idea. Each box has a 1\/3 chance. Your uncertainty is maximal. Entropy = log\u2082(3) \u2248 1.58 bits.<\/li>\n<\/ul>\n<p>Now replace boxes with tokens and the cookie with the \u201ccorrect\u201d next word. At every generation step, an LLM faces this exact choice \u2014 except instead of 3 boxes, it picks from a vocabulary of more than 100,000 tokens. When the model is confident, one token dominates and the entropy is low. When it hesitates, entropy goes up.<\/p>\n<p><img decoding=\"async\" class=\"lazyload aligncenter wp-image-1190468\" src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3.png\" data-orig-src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3.png\" alt=\"\" width=\"800\" height=\"434\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27800%27%20height%3D%27434%27%20viewBox%3D%270%200%20800%20434%27%3E%3Crect%20width%3D%27800%27%20height%3D%27434%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E\" data-srcset=\"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3-200x109.png 200w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3-300x163.png 300w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3-400x217.png 400w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3-600x326.png 600w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3-768x417.png 768w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3-800x434.png 800w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-3.png 1008w\" data-sizes=\"auto\" data-orig-sizes=\"(max-width: 800px) 100vw, 800px\" \/><\/p>\n<p style=\"text-align: center;\"><em>The probability distribution spread in two different cases.<\/em><\/p>\n<p>The key insight is that high entropy at a given token position correlates with a higher likelihood of error at that position. The model is telling you, through its probability distribution, that it isn\u2019t sure what comes next. We just need to listen.<\/p>\n<h2>From entropy to hallucination scores<\/h2>\n<h3>EPR: Entropy Production Rate<\/h3>\n<p>Our first metric, EPR (Entropy Production Rate), is direct. For each token in the generated sequence, we compute the entropy of the model\u2019s top-K predicted token probabilities. Then we average across the sequence. This gives a single number reflecting the model\u2019s average hesitation over the full response.<\/p>\n<p>This is an unsupervised metric: no labels required. In our experiments (published at ECIR 2026), EPR alone achieves ROC-AUC scores between 74 and 81 on TriviaQA across four different LLMs. Not bad for a metric that costs essentially nothing beyond a single generation pass.<\/p>\n<p>But we can do better.<\/p>\n<h3>WEPR: Weighted Entropy Production Rate<\/h3>\n<p>Raw entropy treats all token ranks equally. The entropy contribution of the 1st-ranked token (the most probable one) and the 10th-ranked token are weighted the same. In practice, the way uncertainty distributes across ranks carries discriminative information.<\/p>\n<p>WEPR (Weighted EPR) learns a set of weights to re-balance these contributions. It uses two signals:<\/p>\n<ul>\n<li>The <strong>mean<\/strong> weighted entropy across the sequence \u2014 capturing overall hesitation.<\/li>\n<li>The <strong>maximum<\/strong> entropy contribution per rank \u2014 capturing uncertainty spikes. A single moment of high hesitation can be the hallmark of a hallucination, even if the rest of the sequence was generated confidently.<\/li>\n<\/ul>\n<p>These features are fed into a logistic regression, trained on a labeled dataset. The output of the sigmoid is a calibrated probability:<\/p>\n<blockquote><p>\u201cThis response has a 86% probability of containing a hallucination.\u201d<\/p><\/blockquote>\n<p><img decoding=\"async\" class=\"lazyload aligncenter wp-image-1190469\" src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4.png\" data-orig-src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4.png\" alt=\"\" width=\"700\" height=\"311\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27700%27%20height%3D%27311%27%20viewBox%3D%270%200%20700%20311%27%3E%3Crect%20width%3D%27700%27%20height%3D%27311%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E\" data-srcset=\"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4-200x89.png 200w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4-300x133.png 300w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4-400x178.png 400w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4-600x267.png 600w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4-768x342.png 768w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4-800x356.png 800w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-4.png 805w\" data-sizes=\"auto\" data-orig-sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>Beyond classification, WEPR also produces token-level scores. Each token in the generated sequence gets its own hallucination probability, allowing you to pinpoint exactly which parts of a response deserve scrutiny. This is computed in real time, token by token, as the model generates \u2014 no need to wait for the full output.<\/p>\n<p><img decoding=\"async\" class=\"lazyload aligncenter wp-image-1190470\" src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5.png\" data-orig-src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5.png\" alt=\"\" width=\"700\" height=\"243\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27700%27%20height%3D%27243%27%20viewBox%3D%270%200%20700%20243%27%3E%3Crect%20width%3D%27700%27%20height%3D%27243%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E\" data-srcset=\"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5-200x70.png 200w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5-300x104.png 300w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5-400x139.png 400w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5-600x209.png 600w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5-768x267.png 768w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5-800x278.png 800w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5-1024x356.png 1024w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-5.png 1053w\" data-sizes=\"auto\" data-orig-sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<h3>What about labels?<\/h3>\n<p>A supervised method requires annotations. Labeling thousands of QA pairs by hand is slow. So we use an <strong>LLM-as-a-judge<\/strong> approach: a separate model compares each generated answer to the known ground truth and labels it as correct or incorrect.<\/p>\n<p>Is this reliable? We validated it against human annotators. A group of 15 researchers hand-labeled over 1,300 answer pairs. Agreement between the automated judge and human evaluators reached 95.7%, with a Cohen\u2019s Kappa of 0.90. The automated labels are a reliable proxy for human judgment and are robust enough to train a hallucination detector on.<\/p>\n<h2>Introducing artefactual: now it\u2019s your turn to play.<\/h2>\n<p>We packaged all of this into an open-source Python library: <a href=\"https:\/\/github.com\/artefactory\/artefactual\/\">artefactual<\/a>.<\/p>\n<p>The library ships with pre-computed calibration weights for several model families (Mistral-Small, Falcon-3, Phi-4, Ministral-8B), so you can start scoring outputs immediately without running any training pipeline. It parses outputs from vLLM, the OpenAI Chat Completions API, and the OpenAI Responses API out of the box.<\/p>\n<p>Here is the simplest possible usage:<\/p>\n<p><img decoding=\"async\" class=\"lazyload aligncenter wp-image-1190471\" src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-6.png\" data-orig-src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-6.png\" alt=\"\" width=\"700\" height=\"657\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27700%27%20height%3D%27657%27%20viewBox%3D%270%200%20700%20657%27%3E%3Crect%20width%3D%27700%27%20height%3D%27657%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E\" data-srcset=\"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-6-200x188.png 200w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-6-300x281.png 300w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-6-400x375.png 400w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-6.png 535w\" data-sizes=\"auto\" data-orig-sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<p>The token-level scores are particularly useful for visualization. Each token in the response gets its own hallucination probability, which you can render as a color gradient, green for confident or red for uncertain. At a glance, you see exactly which parts of a response deserve scrutiny.<\/p>\n<h2>In a RAG pipeline<\/h2>\n<p>Where this gets practical is in Retrieval-Augmented Generation. Imagine a pipeline that retrieves documents from a knowledge base and feeds them as context to an LLM. If the retrieval fails (wrong documents, missing pages, incomplete context, etc.) the model will attempt to fill the gaps from its parametric memory, and that is where hallucinations creep in.<\/p>\n<p>With artefactual, you can add a gate:<\/p>\n<p><img decoding=\"async\" class=\"lazyload aligncenter wp-image-1190472\" src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-7.png\" data-orig-src=\"https:\/\/www.staging-artefact.digitalseeder.com\/\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-7.png\" alt=\"\" width=\"700\" height=\"187\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%27700%27%20height%3D%27187%27%20viewBox%3D%270%200%20700%20187%27%3E%3Crect%20width%3D%27700%27%20height%3D%27187%27%20fill-opacity%3D%220%22%2F%3E%3C%2Fsvg%3E\" data-srcset=\"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-7-200x53.png 200w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-7-300x80.png 300w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-7-400x107.png 400w, https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time-Image-7.png 454w\" data-sizes=\"auto\" data-orig-sizes=\"(max-width: 700px) 100vw, 700px\" \/><\/p>\n<h2>Our scientific Article In a nutshell \u2014 What We found<\/h2>\n<p>We tested EPR and WEPR across four LLMs (Mistral-Small-24B, Falcon-3\u201310B, Phi-4, Ministral-8B) on three tasks: hallucination detection on TriviaQA, generalization to WebQuestions, and missing-context detection in a financial RAG setting.<\/p>\n<p>A few highlights:<\/p>\n<ul>\n<li><strong>WEPR consistently outperforms existing methods.<\/strong> It beats both SelfCheckGPT (a multi-shot method requiring 10x the compute) and HalluDetect (a single-shot competitor) across nearly all model-dataset combinations.<\/li>\n<li><strong>You don\u2019t need many log-probabilities.<\/strong> Performance plateaus around K = 8\u201310 accessible log-probabilities per token. Even with limited API access, the signal is there.<\/li>\n<li><strong>It generalizes.<\/strong> WEPR trained on TriviaQA transfers well to WebQuestions and even to a specialized financial corpus, detecting cases where a RAG system generated answers without sufficient context.<\/li>\n<li><strong>It\u2019s fast.<\/strong> Scoring takes roughly 80 microseconds per sequence. Compare that to &gt;10 seconds for SelfCheckGPT.<\/li>\n<\/ul>\n<p>In our experiments on a financial RAG task (analyzing 10-K annual reports from the ArGiMi-Ardian dataset), WEPR reached up to 93.6 ROC-AUC in detecting responses generated without the right context. This is a strong signal for triggering a second retrieval pass.<\/p>\n<h3>Note on log-probability access:<\/h3>\n<p>Everything described above relies on one thing: access to token-level log-probabilities from the model. This is what lets us compute entropy and, by extension, hallucination scores.<\/p>\n<p>Today, this access is not guaranteed. Anthropic does not expose log-probabilities through its API. OpenAI provides them for non-reasoning models \u2014 you can request top_logprobs with GPT-5.4 or GPT-5.4-mini, but only if you set the reasoning effort to none . On the other hand, Google allows access to all logprobs with its generate_content API.<\/p>\n<p>Open-weight models served through vLLM or similar inference engines give full access.<\/p>\n<p>This matters. Log-probabilities are a lightweight, information-rich signal. They cost nothing extra to produce (the model computes them anyway during generation) and they enable a whole class of uncertainty quantification methods \u2014 ours included. Restricting access to them pushes users toward either blind trust in model outputs or expensive multi-shot detection methods.<\/p>\n<p>If you work with LLMs in production and care about output reliability, the availability of log-probabilities should be part of your model selection criteria. And if you are a model provider: exposing log-probabilities is one of the cheapest ways to make your models more trustworthy.<\/p>","protected":false},"excerpt":{"rendered":"<p>Les grands mod\u00e8les linguistiques sont \u00e9tonnamment capables. Ils r\u00e9sument, traduisent, raisonnent et codent (mieux que moi). Mais contrairement \u00e0 moi, ils sont \u00e9galement devenus c\u00e9l\u00e8bres pour avoir invent\u00e9 des faits avec une confiance d\u00e9concertante.<\/p>","protected":false},"featured_media":1190338,"parent":0,"template":"","meta":{"_acf_changed":false},"blog-category":[2995,21939],"blog-language":[2991],"class_list":["post-1190337","blog","type-blog","status-publish","has-post-thumbnail","hentry","blog-category-ai-technology","blog-category-medium","blog-language-en"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.6 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Detecting Hallucinations in LLMs, One Token at a Time - Artefact<\/title>\n<meta name=\"description\" content=\"Large Language Models are astonishingly capable. They summarize, translate, reason, and code (better than me). But unlike me, they have...\" \/>\n<meta name=\"robots\" content=\"noindex, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<meta property=\"og:locale\" content=\"fr_FR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Detecting Hallucinations in LLMs, One Token at a Time - Artefact\" \/>\n<meta property=\"og:description\" content=\"Large Language Models are astonishingly capable. They summarize, translate, reason, and code (better than me). But unlike me, they have...\" \/>\n<meta property=\"og:url\" content=\"https:\/\/staging-artefact.digitalseeder.com\/fr\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/\" \/>\n<meta property=\"og:site_name\" content=\"Artefact\" \/>\n<meta property=\"article:modified_time\" content=\"2026-05-11T16:32:03+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time.png\" \/>\n\t<meta property=\"og:image:width\" content=\"900\" \/>\n\t<meta property=\"og:image:height\" content=\"555\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@Artefact\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/blog\\\/detecting-hallucinations-in-llms-one-token-at-a-time\\\/\",\"url\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/blog\\\/detecting-hallucinations-in-llms-one-token-at-a-time\\\/\",\"name\":\"Detecting Hallucinations in LLMs, One Token at a Time - Artefact\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/blog\\\/detecting-hallucinations-in-llms-one-token-at-a-time\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/blog\\\/detecting-hallucinations-in-llms-one-token-at-a-time\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time.png\",\"datePublished\":\"2026-05-11T16:17:41+00:00\",\"dateModified\":\"2026-05-11T16:32:03+00:00\",\"description\":\"Large Language Models are astonishingly capable. They summarize, translate, reason, and code (better than me). But unlike me, they have...\",\"inLanguage\":\"fr-FR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/blog\\\/detecting-hallucinations-in-llms-one-token-at-a-time\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"fr-FR\",\"@id\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/blog\\\/detecting-hallucinations-in-llms-one-token-at-a-time\\\/#primaryimage\",\"url\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time.png\",\"contentUrl\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/wp-content\\\/uploads\\\/2026\\\/05\\\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time.png\",\"width\":900,\"height\":555},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/#website\",\"url\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/\",\"name\":\"Artefact\",\"description\":\"\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/staging-artefact.digitalseeder.com\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"fr-FR\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Detecting Hallucinations in LLMs, One Token at a Time - Artefact","description":"Large Language Models are astonishingly capable. They summarize, translate, reason, and code (better than me). But unlike me, they have...","robots":{"index":"noindex","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"og_locale":"fr_FR","og_type":"article","og_title":"Detecting Hallucinations in LLMs, One Token at a Time - Artefact","og_description":"Large Language Models are astonishingly capable. They summarize, translate, reason, and code (better than me). But unlike me, they have...","og_url":"https:\/\/staging-artefact.digitalseeder.com\/fr\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/","og_site_name":"Artefact","article_modified_time":"2026-05-11T16:32:03+00:00","og_image":[{"width":900,"height":555,"url":"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time.png","type":"image\/png"}],"twitter_card":"summary_large_image","twitter_site":"@Artefact","schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/staging-artefact.digitalseeder.com\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/","url":"https:\/\/staging-artefact.digitalseeder.com\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/","name":"Detecting Hallucinations in LLMs, One Token at a Time - Artefact","isPartOf":{"@id":"https:\/\/staging-artefact.digitalseeder.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/staging-artefact.digitalseeder.com\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/#primaryimage"},"image":{"@id":"https:\/\/staging-artefact.digitalseeder.com\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/#primaryimage"},"thumbnailUrl":"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time.png","datePublished":"2026-05-11T16:17:41+00:00","dateModified":"2026-05-11T16:32:03+00:00","description":"Large Language Models are astonishingly capable. They summarize, translate, reason, and code (better than me). But unlike me, they have...","inLanguage":"fr-FR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/staging-artefact.digitalseeder.com\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/"]}]},{"@type":"ImageObject","inLanguage":"fr-FR","@id":"https:\/\/staging-artefact.digitalseeder.com\/blog\/detecting-hallucinations-in-llms-one-token-at-a-time\/#primaryimage","url":"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time.png","contentUrl":"https:\/\/staging-artefact.digitalseeder.com\/wp-content\/uploads\/2026\/05\/Detecting-Hallucinations-in-LLMs-One-Token-at-a-Time.png","width":900,"height":555},{"@type":"WebSite","@id":"https:\/\/staging-artefact.digitalseeder.com\/#website","url":"https:\/\/staging-artefact.digitalseeder.com\/","name":"Artefact","description":"","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/staging-artefact.digitalseeder.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"fr-FR"}]}},"_links":{"self":[{"href":"https:\/\/staging-artefact.digitalseeder.com\/fr\/wp-json\/wp\/v2\/blog\/1190337","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/staging-artefact.digitalseeder.com\/fr\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/staging-artefact.digitalseeder.com\/fr\/wp-json\/wp\/v2\/types\/blog"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/staging-artefact.digitalseeder.com\/fr\/wp-json\/wp\/v2\/media\/1190338"}],"wp:attachment":[{"href":"https:\/\/staging-artefact.digitalseeder.com\/fr\/wp-json\/wp\/v2\/media?parent=1190337"}],"wp:term":[{"taxonomy":"blog-category","embeddable":true,"href":"https:\/\/staging-artefact.digitalseeder.com\/fr\/wp-json\/wp\/v2\/blog-category?post=1190337"},{"taxonomy":"blog-language","embeddable":true,"href":"https:\/\/staging-artefact.digitalseeder.com\/fr\/wp-json\/wp\/v2\/blog-language?post=1190337"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}