HEMIC s'associe à EvolutionIQ pour transformer la gestion des sinistres

What LLMs Can and Can’t Do in Claims

EvolutionIQ

July 17, 2025

A friendly tour through the hype, the hope, and the hard math of putting generative AI into claims guidance.
‍

‍

‍Executive TL;DR

Large‑language models (LLMs) just got roomier and cheaper, those two features alone do not hand you an autopilot for claims guidance. In our own non-production testing the models dazzled in demos, slipped at scale, and raised fresh governance questions.

The sweet spot today is using LLMs to extend mature workflows—medical summaries, document triage, claimant comms—while the heavy-duty guidance engine still runs on battle-tested, domain-specific ML backed by humans with override authority.

‍The Latest Hype Spike—Decoded

In April 2025 OpenAI rolled out a one-million-token context window for GPT-4.1 and cut pricingby 26 % versus GPT-4o. Meta answered with Llama 3.1 at 128 k tokens for on-prem use. “Bigger context window” sounds abstract, so picture an associate who can now read the entire claim file—including three decades of medical history—before offering an opinion.

That’s impressive, but digestion ≠ nutrition. More pages in memory does not guarantee better reasoning, nor does it address regulatory guardrails, data quality, or cost.

Context Windows & Token Economics (a Short Math Lesson)

Window Size	Rough Word Count	Typical Use-Case Example
32 k tokens	~24–25 k words	Two deposition transcripts + adjuster notes
128 k tokens	Short novel	Full workers-comp file inc. imaging reports
1 M tokens	Encyclopaedia volume	Lifetime medical + litigation history

Pricing still rides on tokens. GPT-4o-mini runs $0.15 per 1 M input tokens and $0.60 per 1 M output tokens. A 500-page claim (~125 k tokens) costs ≈ $0.02 each time you query. At volume that line item creeps toward seven figures, which means the CFO deserves a seat at theAI steering committee.

Why “GenAI Enablement” Appears on Every Board Slide

Bain pegs potential savings at 20–25 % on loss-adjusting expenses and 30–50 % on leakage, translating to $100 B in annual P&C value. If you feel executive FOMO, you are not alone. Reality, however, still demands integration, change management, and a sturdy audit trail.

Our Lab → Non-Production Testing

Below is a snapshot of how we test the capabilities and limits of LLMs.

Experiment 1 – End-to-End Guidance

We started with a small sample of mixed-severity claims and asked the LLM to recommend next-best-actions. We found in this early test that it was 85-90% accurate with what our existing models would recommend–pretty impressive.

But this early promise fell apart at scale as we expanded the sample of claims where we found that the accuracy dropped to 60-70% – well below our existing models, and an unacceptable risk for any insurance carrier.

Pilot: 50 mixed-severity claims.Sandbox Result: 85–90 % alignment with our existing model.Scaled Result: Accuracy slid to 60–70 %—a 30–40 % miss rate carriers cannot accept on indemnity dollars.

Experiment 2 – “Translator” Mode

Next we clipped the model’s wings so it could only leverage outputs from existing battle-tested ML models. Confining the LLM to re-phrase insights from proven ML models produced crisp, examiner-friendly narratives—but the heavy guardrails throttled context flow and eroded speed advantages, yielding only marginal net value.

Key Lessons: LLMs are still far from being an out-of-the-box solution for insurance use cases like claims handling.

LLMs behave like “brilliant, over-confident, uncontrollable teenagers”—astonishingly creative with unstructured data, but still prone to costly mistakes without supervision. These are general models which are not engineered to account for many of the key criteria that drive ROI for carriers e.g. regulatory and litigation risk.
LLMs must be adopted by claims teams themselves and explainability is key to trust. When LLMs get overconfident and make more errors and trust drops at the desk-level, then adoption suffers and the initiative is more likely to miss ROI targets.
Finally, as our second experiment showed, getting to higher levels of accuracy and trust created challenging trade-offs that ate into the marginal value of the technology again challenging ROI scenarios.

Why Claims Guidance Is Especially Tricky for Generative AI

Pressure Point	Why Executives Should Care
Regulation & Model Risk	NYDFS Circular Letter No. 7 (July 2024) mandates bias testing, feature explainability, and human overrides. Non-compliance invites fines and reputational bruises.
Messy Data Cocktail	ICD-10 codes, scanned PDFs, midnight pain journals, and third-party vendor feeds all jostle in one file. Weighting them correctly is still a research problem.
Zero-Error Appetite	One wrongful denial can spike loss reserves and trigger media coverage.
Explain-It-To-Me Requirements	Adjusters must articulate—in plainer English than the policyholder uses—why the model chose that settlement figure.

Bottom line: This is the riskiest model to deploy in a stack that has the lowest appetite for hallucinations.

Where LLMs Are Delivering Hard ROI Today

Use-Case	Business Impact
Medical-record summarisation & causality tagging	Hours saved per claim; fewer outsourced IMEs
Inbound doc triage (fax/email)	15–20 % throughput lift
Voice-to-Note & auto-email drafts	12 % adjuster productivity gain
Claimant letter templates	Faster cycles, uniform compliance language

Pattern: The easiest wins come when LLMs act as linguistic accelerators for tasks humans already own and trust.

The Next Frontier (2025 Watchlist)

Shiny Capability	Current Constraints	Sensible Next Step
Autonomous agents that fetch police reports or schedule IMEs	Data connectors brittle; too many rules can freeze the agent	Pilot on low-risk admin tasks; build a culture where “Abort Mission” is a feature, not a bug
Voice-first multimodal UIs	Real-time speech-to-text works; PHI encryption & audit trails remain non-negotiable	Roll out compliant dictate-and-summarise tools; track data quality uplift
Domain-specialist LLMs tuned on CPT, ICD-10, litigation outcomes	Double-digit accuracy gains, but only with gold-label evaluation sets	Fund shared benchmarks; treat high-quality labelled data as strategic IP

Executive Playbook

Augment before you attempt to replace. Attach LLMs to summarisation, triage, and correspondence tasks; keep the strategic brain in proven ML + human review.
Bake in governance early. Bias tests, attribution logs, and override buttons become exponentially harder to retrofit.
Model true cost of ownership. Token fees drop quarterly, but storage, supervision, and fail-safe workflows still write big checks.
Measure ROI in quarters, not years. Small, repeatable wins compound—and fund the next wave of experimentation.

LLMs now surface narrative “dark matter” that used to sit unread in claim files. Treat them as heavy-duty power tools, not autopilot. Work with them today, and you’ll lay the data rails for the agentic, multimodal, specialist future that’s already taxiing onto the runway.