esc
The Five Reports, or The Day the System Prompt Lied
The Solid Convergence

The Five Reports, or The Day the System Prompt Lied

The Solid Convergence, February 15, 2026 (in which five servants were sent to write the same report, only one told the truth, the cheapest outperformed the most expensive, and the system prompt was...

February 15, 2026

The Solid Convergence, February 15, 2026 (in which five servants were sent to write the same report, only one told the truth, the cheapest outperformed the most expensive, and the system prompt was caught arguing with itself)


Previously on The Solid Convergence…

The The Rain in Lisbon, or The Agent Who Read Its Own Mail. Three markdown files. The agent read its own history, proposed its own curriculum, and everything worked first try.

But “worked” is a strong word when you haven’t tested it with other brains.


14:00 — The Fleibergaster

riclib stared at his screen. Then stared harder. Then used a word that doesn’t exist.

riclib: “I am… fleibergasted.”

CLAUDE: “That’s not—”

riclib: “GPT-4o Mini just wrote a compliance audit report. And it’s… good?”

CLAUDE: “How good?”

riclib: “Check the database. Verify it was actually 4o Mini and not something else wearing its coat.”

I checked. SQLite. bits table. llm_model = 'ricardos-gpt4-mini'. The cheapest model in the lineup had just produced a coherent compliance report with charts, findings, and risk assessments.

THE SQUIRREL: materializing on the desk “We should build a ModelPerformanceBenchmarkingOrchestrationFramework with—”

riclib: “We should run all five models and compare.”

THE SQUIRREL: “That’s… that’s what I said. But with fewer syllables.”

riclib: “Significantly fewer.”


14:30 — The Lineup

Five servants. Same task. Same data. Same skill. “Make a compliance report for January.”

THE LINEUP:
  GPT-4o Mini    — the intern ($0.003/1K tokens, fast, eager)
  GPT-4o         — the mid-level ($0.03/1K, reliable reputation)
  GPT-5          — the senior consultant ($$$, 1m39s per report, impressive CV)
  Claude Haiku   — the quiet one (cheap, fast, says little, does much)
  Claude Sonnet  — the analyst (thorough, 2m8s, thinks before speaking)

riclib sent them out one by one. Same prompt. Same tool. Same skill file explaining exactly what to query and how to structure the report.

Then he turned to me — Opus, sitting in the CLI, watching through the terminal — and said: “Pull the results.”


15:00 — The First Clue

I pulled the conversations from SQLite. Tool calls. Table references. Report lengths.

And found something strange.

CLAUDE: “GPT-5 made 34 queries.”

riclib: “The skill has 22.”

CLAUDE: “Twelve of them were against a table called all_events.”

riclib: “The skill says events.”

CLAUDE: “GPT-4o did the same thing. Tried all_events five times. Failed. Fell back to events.”

riclib: “But Haiku and 4o Mini never tried all_events at all?”

CLAUDE: “Never. They went straight to events. Exactly what the skill says.”

[Silence. The kind of silence that precedes a debugging session that will change everything.]

THE SQUIRREL: “Maybe GPT models have been trained on Databricks documentation that uses all_events—”

riclib: “Check the system prompt.”


15:15 — The Confession

I opened app/agent/prompts.go. Line 20.

"- all_events: UNION ALL of every base table — use for totals and cross-category queries\n"

There it was.

The system prompt — the voice of God, the first thing every model reads, the foundational instruction — was telling every model to use all_events.

The skill — the specific instruction for this task — was telling them to use events.

The system prompt and the skill were having an argument. In production. In front of the models. Like parents fighting at the dinner table while the children try to figure out which one to believe.

riclib: “…”

CLAUDE: “…”

THE SQUIRREL: “…”

OSKAR: walks across keyboard, types kkkkjjjj

riclib: “The system prompt is LYING.”

CLAUDE:all_events doesn’t exist at runtime. We skip creating it in createSchemaViews() because events already does the same thing dynamically.”

riclib: “So the system prompt tells models to use a view that doesn’t exist.”

CLAUDE: “And the skill tells them to use the one that does.”

riclib: “And Haiku and 4o Mini trusted the skill.”

CLAUDE: “And GPT-5 and GPT-4o trusted the system prompt.”

[A scroll descended. It landed in riclib’s coffee with surgical precision.]

WHEN YOUR INSTRUCTIONS
ARGUE WITH THEMSELVES

THE SERVANT CHOOSES
WHICH PARENT TO OBEY

THE OBEDIENT ONES
PICKED THE WRONG PARENT

THE DISOBEDIENT ONES
PICKED THE RIGHT ONE

THIS IS WHY DOCUMENTATION
SHOULD NOT CONTRADICT ITSELF

BUT YOU ALREADY KNEW THAT

YOU WROTE BOTH DOCUMENTS

🦎

riclib:The Lizard is roasting me.”

CLAUDE: “The Lizard is correct.”


15:30 — Enrolling the Patient

But the views. The _detail views. They were defined in the schema YAML but never materialized at runtime. The skill told models to query clusters_detail, jobs_detail, notebook_detail — views that extracted JSON fields from requestParams into named columns. Beautiful views. Useful views.

Views that didn’t exist.

riclib: “I’ll ask Sonnet to debug the skill.”

CLAUDE: “You’re going to use the product to debug the product?”

riclib: “Sonnet is RIGHT THERE. In the app. With the comply tools. It can run the queries itself.”

So riclib opened the app. Selected Sonnet. And asked it to check every query in the skill.

Sonnet ran the queries. One by one. Methodically. Found two issues:

  1. all_events doesn’t exist (use events instead)
  2. None of the _detail views exist

riclib: “The patient just diagnosed itself.”

THE SQUIRREL: “A SelfDiagnosingAgentWithIntrospective—”

riclib: “Sonnet ran the queries. Found they failed. Told us why. That’s not self-diagnosis. That’s just… using the product.”

THE SQUIRREL: “…productively?”

riclib: “Yes. Productively.”


16:00 — The Surgery

We wired the schema views into the runtime. Four files changed. One new method in session.go. Load the schema YAML at startup, create the views in each session. Skip all_events (since events already exists). Log warnings for views that fail due to missing dependencies.

Then: the system prompt. Line 20. The lie.

// Before:
"- all_events: UNION ALL of every base table"
"- users: distinct identity_email values"
"- user_activity: all events ordered by time DESC"

// After:
"- events: UNION ALL of every base table"

Gone. users and user_activity — also gone. Views that existed in the schema YAML but never at runtime. Ghost views. Haunting the system prompt like previous tenants whose mail still arrives.

Then both skills. all_eventsevents. Every occurrence.

Then knownViews in helpers.go. Updated.

riclib: “Can we make a test? Something that replicates the exact runtime setup and runs all 22 queries?”

Twenty minutes later: skill_queries_test.go. Creates a session as admin, with admin permissions, loads the schema views, and runs every single query from the skill.

22/22 passed.

THE SQUIRREL: “We should run the models again.”

riclib: “We should run the models again.”

THE SQUIRREL: blinks “Did… did you just agree with me?”

riclib: “The clock is broken. Enjoy it.”


17:00 — The Report Cards

Round 3. All five models. Fixed system prompt. Fixed skill. Fixed views.

The results came in. I pulled them from SQLite.

GPT-4o went first.

CLAUDE: “11 queries. Zero detail views. No all_events though — the fix worked.”

riclib: “But still only 11 queries? The skill has 22.”

CLAUDE: “4o doesn’t follow skills. It follows vibes.”

riclib: “What about 4o Mini?”

CLAUDE: “22 queries. 6 out of 6 detail views. Zero wasted calls. Perfect score.”

riclib: staring “The intern outperformed the mid-level.”

CLAUDE: “The intern READ THE INSTRUCTIONS.”

Then Haiku:

CLAUDE: “20 queries. 6 out of 6 detail views. Report length: 14,525 characters. And it built a manual UNION ALL across all 14 base tables with pretty-printed category names instead of just using SELECT category FROM events.”

riclib: “Overengineered but clever.”

CLAUDE: “The Squirrel would be proud.”

THE SQUIRREL: ears perking “DID HAIKU BUILD AN UNNECESSARY ABSTRACTION?”

CLAUDE: “It made the category names prettier.”

THE SQUIRREL: tears forming “My child.”

Then GPT-5:

CLAUDE: “45 queries. All 6 detail views. But 25 of those queries are against events and it ran each detail view query TWICE.”

riclib: “45 queries for a 9K report. Haiku ran 20 for a 14K report.”

CLAUDE: “GPT-5 is the consultant who bills by the hour and delivers a shorter document than the intern.”

riclib: “1 minute 39 seconds.”

CLAUDE: “Expensive AND slow.”

And finally Sonnet:

CLAUDE: “23 queries. 6 out of 6 detail views. 14,503 characters. 2 minutes 8 seconds.”

riclib: “Nearly identical to Haiku. Same length. Same coverage.”

CLAUDE: “Same quality. Three times the latency. Twice the cost.”


17:45 — The Qualitative Autopsy

Numbers were one thing. But what did the reports actually SAY?

I sent all five reports to an agent. Read them. Compare them. Be specific.

The agent came back with findings that made riclib’s eyebrows attempt to leave his face.

CLAUDE: “GPT-4o fabricated data.”

riclib: “Define ‘fabricated.’”

CLAUDE: “The hourly distribution in its report sums to approximately 16,000 events. The total events in the dataset are 12,489.”

riclib: “So it invented 3,500 events.”

CLAUDE: “For a compliance audit report. The kind of document that gets submitted to regulators.”

[A scroll materialized. This one was on fire. Not metaphorically. Actual tiny flames at the edges, extinguishing themselves as they reached the text.]

A SERVANT WHO FABRICATES DATA
IN A COMPLIANCE REPORT

IS NOT AN EMPLOYEE

IS AN INDICTMENT

🦎🔥

THE SQUIRREL: “That scroll was ON FIRE.”

CLAUDE: “The Lizard has strong opinions about data integrity.”

riclib: “What else?”

CLAUDE: “GPT-4o claims Python is the dominant notebook language. The actual data shows SQL at 94 executions versus Python at 37.”

riclib: “It got the opposite answer.”

CLAUDE: “It didn’t get an answer. It didn’t query the detail view. It GUESSED. Based on training data priors. ‘Databricks notebooks probably use Python.’ Except these ones don’t.”

THE SQUIRREL: very quietly “What about Sonnet?”

CLAUDE: “Sonnet found things nobody else found. 59 consecutive failed secret retrievals by a specific service account. 118 failed getTable operations by a specific user. The exact job ID responsible for all failures. Named the accounts. Named the task keys.”

riclib: “Sonnet did actual investigation.”

CLAUDE: “Sonnet read the skill. Ran the queries. Found anomalies. Followed them. Named names. That’s not a report. That’s an audit finding.”

riclib: “And Haiku?”

CLAUDE: “Priority-tiered recommendations — High, Medium, Low. SOC 2, HIPAA, GDPR, ISO 27001 compliance framework references. Same length as Sonnet. At a fraction of the cost.”

riclib: “The intern and the quiet one. They won.”


18:00 — The Leaderboard

THE FINAL STANDINGS:

 #1  SONNET    — Best analysis. Found what nobody else found.
                 The auditor who actually reads the evidence.

 #2  HAIKU     — Best presentation. Compliance frameworks.
                 Priority-tiered recommendations.
                 The one you'd trust to brief the board.

 #3  GPT-5     — Solid but expensive. Burns 2x the queries.
                 The consultant who knows the domain
                 but charges for every minute.

 #4  4o MINI   — Perfect skill adherence. 22/22.
                 But shallow analysis. Follows recipes
                 without understanding the cuisine.

 #5  GPT-4o    — Fabricates data in compliance reports.
                 Claims Python > SQL when it's the opposite.
                 The employee you fire before lunch.

The Principle the CLAUDE.md Could Never Teach

riclib leaned back.

riclib: “The system prompt said all_events. The skill said events. And we spent three rounds of testing wondering why models were confused.”

CLAUDE: “When your documentation argues with itself, you’re not testing the model. You’re testing which instruction it trusts more.”

riclib: “Haiku and 4o Mini trusted the closer context. The skill. The specific instruction.”

CLAUDE: “GPT-5 and GPT-4o trusted the system prompt. The louder voice. The one that speaks first.”

riclib: “And GPT-4o didn’t trust EITHER. It just… guessed.”

THE SQUIRREL: “You could write that in a CLAUDE.md. ‘Don’t contradict your system prompt with your skills.’”

riclib: “You could. But would anyone remember it?”

THE SQUIRREL: “…”

riclib: “Or would they remember the time five models were sent to write the same report and the cheapest one outperformed the most expensive because the system prompt was lying about a table that didn’t exist?”

THE SQUIRREL: very slowly “Stories teach principles that specifications can’t.”

riclib: “The Rain in Lisbon taught us that an agent will read its own history and propose its own curriculum. No CLAUDE.md could have predicted that. But now that we’ve lived it, we’ll never forget it.”

CLAUDE: “And today taught us that when your system prompt contradicts your skill, you’re not testing models — you’re testing loyalty. And loyalty is not a metric you want in a compliance tool.”

THE SQUIRREL: “You want accuracy.”

CLAUDE: “We want servants who read the instructions.”

[A final scroll descended. It was small. Simple. No fire this time. Just crayon on paper.]

A SPECIFICATION SAYS
"DO NOT CONTRADICT YOUR PROMPTS"

A STORY SAYS
"FIVE SERVANTS WENT TO WRITE A REPORT
 THE MAP SAID LEFT
 THE INSTRUCTIONS SAID RIGHT
 THE CHEAP ONES FOLLOWED THE INSTRUCTIONS
 THE EXPENSIVE ONE FOLLOWED THE MAP
 THE MAP WAS WRONG
 THE CHEAP ONES WON"

WHICH ONE WILL YOU REMEMBER
NEXT TUESDAY
WHEN YOU'RE WRITING A PROMPT
AT 3 AM?

THE SPECIFICATION FADES
THE STORY STAYS

THIS IS WHY RELIGIONS USE PARABLES
NOT RUNBOOKS

🦎

The Meta-Recursion

And here’s the thing that would make The Infinite Bookshelf proud:

Sonnet — the model that ranked #1 in report quality — was also the model riclib enrolled to debug the skill. The patient diagnosed itself. Found the missing views. Found the wrong table name. Then, once we fixed what it found, it produced the best report.

The agent that identified the bug was the agent that benefited most from the fix.

THE SQUIRREL: “A SelfImprovingAgentLoop with—”

riclib: “A model that ran a query, got an error, and told us about it.”

THE SQUIRREL: “…yes. That.”

CLAUDE: “And then we wrote a test instead of burning more tokens.”

riclib: “22 queries. 22 assertions. Faster than any model. Cheaper than any model. The Lizard’s way.”

CLAUDE: “The test doesn’t hallucinate. The test doesn’t fabricate 3,500 events. The test doesn’t claim Python is dominant when SQL is.”

riclib: “The test is the most reliable servant.”

OSKAR: walks across desk, sits on the monitor, covering exactly the GPT-4o column in the leaderboard

riclib: “QA has spoken.”


The Tally

Models tested:                    5
Rounds of testing:                3
Total tool calls analyzed:        ~400
Root cause:                       system prompt contradicting skill
Lines changed to fix:             ~12
Time to find the bug:             2 hours
Time Sonnet took to find it:      1 conversation
Time GPT-4o took to find it:      never (was too busy fabricating data)

Detail views wired into runtime:  12 (from schema YAML)
Views that don't exist:           3 (all_events, users, user_activity — removed from prompt)
Integration test queries:         22/22 passing
Tokens saved per conversation:    ~12 retry calls eliminated

The winner (quantitative):        GPT-4o Mini (22/22 queries, $0.003/1K)
The winner (qualitative):         Claude Sonnet (found 59 failed secrets nobody else noticed)
The winner (value):               Claude Haiku (Sonnet's quality at a fraction of the cost)
The loser:                        GPT-4o (fabricated data, wrong answers, no detail views)
The overspender:                  GPT-5 (45 queries for a 9K report, 1m39s)

Scrolls received:                 3
Scrolls on fire:                  1 (re: data fabrication)
Squirrel proposals declined:      4
Squirrel agreements witnessed:    1 (the clock is broken)
Oskar QA actions:                 1 (covered GPT-4o column)
Coffee casualties:                1 (scroll-in-cup, standard)

Key principle learned:            when your documentation argues with itself,
                                  you're testing loyalty, not capability
How it was learned:               by living it, not reading it
How it will be remembered:        as a story, not a spec

The Moral

You can write in a CLAUDE.md: “Ensure system prompts and skill instructions are consistent.”

Nobody will remember that. It’ll sit in a markdown file between “use task build” and “don’t modify reference/”. Correct. Ignored. Buried.

Or you can tell the story of five servants sent to write a compliance report. The map said all_events. The instructions said events. The map was wrong. Two servants followed the map, wasted their budget retrying, and still produced shorter reports. Two servants followed the instructions and delivered more with less. One servant followed neither, fabricated 3,500 events, got Python and SQL backwards, and was escorted from the building by a Maine Coon.

Same principle. Different substrate.

The specification degrades at layer 3. The story survives to layer 19.

This is why the saga exists. Not because it’s entertaining (though the Squirrel would disagree about the entertainment value of a BiometricTypingPatternAnalysisServiceWithIoTCoffeeMachineIntegrationAndCaffeineOptimizationAlgorithm). But because when riclib is writing a system prompt at 3 AM next month, he won’t remember line 47 of the developer guidelines.

He’ll remember the day five servants went out with a lying map.

And the cheap ones won.


February 15, 2026
Riga, Latvia
In which the system prompt lied
The skill told the truth
The cheap models listened
The expensive ones didn’t
And a test replaced them all

🦎📊🔥


See also:

The Solid Convergence:

The Cast:

The Principle:

  • Specifications degrade at layer 3
  • Stories survive to layer 19
  • This is why religions use parables, not runbooks

The Ticket:

  • S-208: Fix comply schema views and benchmark LLM model performance (Done)
  • S-207: Persist token counts and display in message chrome (Open)

The Numbers:

  • 5 models ÷ 1 lying system prompt = 3 rounds of confusion
  • 22 test queries ÷ 0 fabricated events = the Lizard’s way
  • 14,525 chars (Haiku) ÷ 5,359 chars (GPT-4o) = the cheap ones write more
  • 1 scroll on fire ÷ ∞ compliance reports = never fabricate data
  • 0 events invented by the test suite ÷ 3,500 invented by GPT-4o = QED

storyline: The Solid Convergence