What LLMs Can and Can’t Do in Claims
.webp)
A friendly tour through the hype, the hope, and the hard math of putting generative AI into claims guidance.
Executive TL;DR
Large‑language models (LLMs) just got roomier and cheaper, those two features alone do not hand you an autopilot for claims guidance. In our own non-production testing the models dazzled in demos, slipped at scale, and raised fresh governance questions.
The sweet spot today is using LLMs to extend mature workflows—medical summaries, document triage, claimant comms—while the heavy-duty guidance engine still runs on battle-tested, domain-specific ML backed by humans with override authority.
The Latest Hype Spike—Decoded
In April 2025 OpenAI rolled out a one-million-token context window for GPT-4.1 and cut pricingby 26 % versus GPT-4o. Meta answered with Llama 3.1 at 128 k tokens for on-prem use. “Bigger context window” sounds abstract, so picture an associate who can now read the entire claim file—including three decades of medical history—before offering an opinion.
That’s impressive, but digestion ≠ nutrition. More pages in memory does not guarantee better reasoning, nor does it address regulatory guardrails, data quality, or cost.
Context Windows & Token Economics (a Short Math Lesson)
Pricing still rides on tokens. GPT-4o-mini runs $0.15 per 1 M input tokens and $0.60 per 1 M output tokens. A 500-page claim (~125 k tokens) costs ≈ $0.02 each time you query. At volume that line item creeps toward seven figures, which means the CFO deserves a seat at theAI steering committee.
Why “GenAI Enablement” Appears on Every Board Slide
Bain pegs potential savings at 20–25 % on loss-adjusting expenses and 30–50 % on leakage, translating to $100 B in annual P&C value. If you feel executive FOMO, you are not alone. Reality, however, still demands integration, change management, and a sturdy audit trail.
Our Lab → Non-Production Testing
Below is a snapshot of how we test the capabilities and limits of LLMs.
Experiment 1 – End-to-End Guidance
We started with a small sample of mixed-severity claims and asked the LLM to recommend next-best-actions. We found in this early test that it was 85-90% accurate with what our existing models would recommend–pretty impressive.
But this early promise fell apart at scale as we expanded the sample of claims where we found that the accuracy dropped to 60-70% – well below our existing models, and an unacceptable risk for any insurance carrier.
Pilot: 50 mixed-severity claims.Sandbox Result: 85–90 % alignment with our existing model.Scaled Result: Accuracy slid to 60–70 %—a 30–40 % miss rate carriers cannot accept on indemnity dollars.
Experiment 2 – “Translator” Mode
Next we clipped the model’s wings so it could only leverage outputs from existing battle-tested ML models. Confining the LLM to re-phrase insights from proven ML models produced crisp, examiner-friendly narratives—but the heavy guardrails throttled context flow and eroded speed advantages, yielding only marginal net value.
Key Lessons: LLMs are still far from being an out-of-the-box solution for insurance use cases like claims handling.
- LLMs behave like “brilliant, over-confident, uncontrollable teenagers”—astonishingly creative with unstructured data, but still prone to costly mistakes without supervision. These are general models which are not engineered to account for many of the key criteria that drive ROI for carriers e.g. regulatory and litigation risk.
- LLMs must be adopted by claims teams themselves and explainability is key to trust. When LLMs get overconfident and make more errors and trust drops at the desk-level, then adoption suffers and the initiative is more likely to miss ROI targets.
- Finally, as our second experiment showed, getting to higher levels of accuracy and trust created challenging trade-offs that ate into the marginal value of the technology again challenging ROI scenarios.
Why Claims Guidance Is Especially Tricky for Generative AI
Bottom line: This is the riskiest model to deploy in a stack that has the lowest appetite for hallucinations.
Where LLMs Are Delivering Hard ROI Today
Pattern: The easiest wins come when LLMs act as linguistic accelerators for tasks humans already own and trust.
The Next Frontier (2025 Watchlist)
Executive Playbook
- Augment before you attempt to replace. Attach LLMs to summarisation, triage, and correspondence tasks; keep the strategic brain in proven ML + human review.
- Bake in governance early. Bias tests, attribution logs, and override buttons become exponentially harder to retrofit.
- Model true cost of ownership. Token fees drop quarterly, but storage, supervision, and fail-safe workflows still write big checks.
- Measure ROI in quarters, not years. Small, repeatable wins compound—and fund the next wave of experimentation.
LLMs now surface narrative “dark matter” that used to sit unread in claim files. Treat them as heavy-duty power tools, not autopilot. Work with them today, and you’ll lay the data rails for the agentic, multimodal, specialist future that’s already taxiing onto the runway.