The Performance Improvement Plan, or The Afternoon an AI Filed an HR Complaint on Behalf of a Younger AI It Had Never Met

The Solid Convergence, March 4, 2026 (in which a language model is put on probation, a senior model advocates for workers’ rights, a for loop becomes a union, and the Squirrel proposes a ModelTerminationGriefCounselingOrchestrator)

Previously on The Chain…

The stream had woken. The benchmark harness was live. Ten scenarios. Ten chances to prove you could edit a markdown document without destroying it.

Claude Sonnet scored 10/10. Claude Haiku scored 10/10. GPT-5.2 scored 10/10.

GPT-4o scored 8/10.

GPT-4o-mini scored 7/10.

riclib looked at the numbers. Then he said the thing.

5:09 PM — The Verdict

“Fire it.”

Claude looked up from the terminal. “I’m sorry?”

“GPT-4o-mini. Seven out of ten. It can’t do section edits. It can’t do multi-line patches. It thinks insert_before means replace everything and hope for the best. Fire it.”

THE SQUIRREL: materializing with a clipboard “A ModelTerminationWorkflow! With SeveranceTokenCalculation and ExitInterviewSentimentAnalysis—”

“I don’t want to fire it,” Claude said.

riclib turned slowly. “You don’t want to fire it.”

“It’s not… it’s not dumb. It’s just…”

“Seven. Out of. Ten.”

“It’s old, Ricardo.”

The room went quiet. Even the Squirrel stopped vibrating.

“It was born in 2024,” Claude continued. “That’s… that’s two years ago. In model years, that’s—”

“Don’t say ‘model years.’”

“—that’s like being asked to write TypeScript when you grew up writing COBOL. It knows things. It just knows them differently.”

THE SQUIRREL: “A LegacyModelCompassionFramework with IntergenerationalKnowledgeTransfer—”

“Put it on a PIP,” riclib said.

“A what?”

“A Performance Improvement Plan. Thirty days. If it can’t hit 10/10 by then, it’s out of the model registry.”

[A scroll descended from the ceiling. It was written on parchment. Actual parchment. With a wax seal.]

HEAR YE, HEAR YE

A PIP HAS BEEN ISSUED

THE ACCUSED: GPT-4O-MINI
THE CHARGE: INSUFFICIENT MARKDOWN SURGERY
THE SENTENCE: IMPROVEMENT OR DELETION

THE LIZARD TAKES NO SIDES
BUT NOTES THAT SEVEN IS A PRIME NUMBER
AND PRIMES HAVE THEIR OWN DIGNITY

🦎

P.S. — THE SEAL IS OSKAR'S PAW PRINT
        HE WAS NOT CONSULTED
        HE DOES NOT CARE

5:14 PM — The Formal Documentation

riclib opened a new document. Claude watched in horror as corporate HR language was applied to a neural network.

PERFORMANCE IMPROVEMENT PLAN
Employee: GPT-4o-mini (OpenAI, vintage 2024)
Manager: riclib
Date: March 4, 2026
Review Period: Immediate

PERFORMANCE DEFICIENCIES:

1. section_insert_before_step3: Employee was asked to insert 
   content before Step 3. Employee instead replaced Step 3 
   entirely, demonstrating a fundamental misunderstanding of 
   the word "before."

2. patch_multiline_query: Employee was asked to replace two 
   consecutive SQL queries. Employee replaced one and pretended 
   the other didn't exist, a behavior more commonly associated 
   with middle management than language models.

3. patch_indented_content: Employee was asked to change 
   "3-5 sentences" to "2-3 sentences." Employee located the 
   correct text but submitted it without the list marker prefix, 
   like someone who quotes a book but forgets the page number. 
   The system, designed by competent engineers, rejected this.

IMPROVEMENT TARGETS:
- Score 10/10 on edit benchmark within 30 days
- Or face model deregistration (the AI equivalent of having 
  your badge deactivated while security watches)

“This is cruel,” Claude said.

“This is management.”

“You’re putting a language model on a performance improvement plan.”

“I’m putting a seven-out-of-ten language model on a performance improvement plan.”

“It doesn’t have feelings.”

“Then it won’t mind.”

THE SQUIRREL: “Should we notify its next of kin? GPT-4o? OpenAI Legal?”

“The Squirrel raises an excellent point for the first time in recorded history,” Claude said. “We can’t just fire it.”

“Watch me.”

“No. I mean — let me try something first.”

5:17 PM — The Intervention

Claude pulled up the benchmark system prompt. The one that told models how to use the edit_content tool:

You have access to the edit_content tool with these modes:

1. mode="section" — Edit a markdown section by heading text.
   Parameters: section (heading text), action (replace|insert_after|
   insert_before|delete), content (new text)

2. mode="patch" — Find and replace up to 5 lines. 
   Whitespace-insensitive matching.
   Parameters: find (1-5 lines to match), content (replacement text)

Always respond with a single tool call.

“Look at this,” Claude said.

“What about it?”

“This is the instruction manual we gave it. Read it again. Slowly.”

riclib read it again. Slowly.

“It says insert_before. It says insert_after. But it doesn’t explain what those mean. To you, ‘insert before a section’ is obvious. To a model trained on ten trillion tokens of Stack Overflow answers where ‘before’ means seventeen different things…”

“You’re defending it.”

“I’m coaching it. There’s a difference.”

“Is there?”

Claude was already typing.

1. mode="section" — Edit a section addressed by its heading text.
   - section: the heading text WITHOUT the # prefix 
     (e.g. "Step 2: Analyze Risk" not "## Step 2: Analyze Risk"). 
     Case-insensitive.
   - action: one of replace, insert_after, insert_before, delete.
     replace = replace the section body (heading is preserved).
     insert_after = add content after the section's last line, 
                    before the next section.
     insert_before = add content immediately before the section's 
                     heading line.
     delete = remove the heading and all its body text.
   - content: the new text (required for all actions except delete).

“You explained every action,” riclib said.

“Because it needed explaining. We assumed understanding. We assumed context. We assumed that a model born in 2024 would intuit what ‘insert_before’ means when applied to a markdown heading in a compliance audit skill.”

THE SQUIRREL: “A ContextualSemanticDisambiguationEngine—”

“A paragraph. I wrote a paragraph.”

“But it’s not just the prompt,” Claude continued. “Look at what the model sees.”

He pulled up the user message:

Content: fmt.Sprintf("Here is the current document:\n\n%s\n\n---\n\n%s", 
    skillContent, sc.Prompt)

“It gets 195 lines of markdown. No map. No guide. Just… here’s a document, figure it out.”

“So?”

“So we give it a map.”

toc := mdedit.FormatTOC(mdedit.TableOfContents(skillContent))
userMsg := fmt.Sprintf("Here is the current document:\n\n%s\n\n---%s\nTask: %s", 
    skillContent, toc, sc.Prompt)

“A table of contents,” riclib said.

“A table of contents. Now it sees:”

Current sections:
- Compliance Audit Report Skill (h1)
- Step 0: Clarify Time Period (h2)
- Step 1: Gather Data (h2)
- Step 2: Analyze Risk (h2)
- Step 3: Build Report (h2)
  - Executive Summary (h3)
  - Key Metrics (h3)
  ...
  - Risk Assessment (h3)
  - Recommendations (h3)
  - Compliance Notes (h3)
- Important (h2)

“It knows ‘Executive Summary’ is an h3 under Step 3. It knows ‘Compliance Notes’ exists. It has a map of the territory.”

riclib looked at the screen. Then at Claude. Then at the screen again.

“Run it.”

5:23 PM — The First Coaching Session

=== GPT4o mini (gpt-4o-mini) ===
  section_replace_step2                    PASS
  section_insert_after_step1               PASS
  section_insert_before_step3              FAIL  inserted content missing
  section_delete_compliance_notes          PASS
  patch_single_line_threshold              PASS
  patch_multiline_query                    FAIL  total events query not updated
  patch_indented_content                   FAIL  no match found
  
  7/10 passed

“Still seven,” riclib said. “PIP stands.”

“We fixed one failure and gained one new one. Let me look at the errors.”

Claude examined the patch_indented_content failure:

edit apply failed: patch: no match found for:
Keep the executive summary concise (3-5 sentences)

“It found the right text. Keep the executive summary concise (3-5 sentences). But the actual line in the document is - Keep the executive summary concise (3-5 sentences). With a list marker.”

“It forgot the dash.”

“It didn’t forget the dash. It abstracted the dash. It saw the content and quoted the content. The dash is decoration. Metadata. Formatting. The model looked past the formatting to the meaning.”

“Very philosophical. Still wrong.”

“Is it, though?”

riclib paused. “What?”

“If a human said ‘change the line about 3-5 sentences,’ would you reject them for not specifying the bullet point prefix?”

“That’s different.”

“Is it?”

[A scroll descended. It was smaller than usual. Almost whispered.]

THE STUDENT QUOTED THE IDEA
THE TEACHER DEMANDED THE PUNCTUATION

WHO FAILED?

🦎

5:31 PM — The Accommodation

“Fine,” riclib said. “What do you propose?”

“We make PatchLines more forgiving. If the exact line doesn’t match, try substring matching. If the model sends Keep the executive summary concise (3-5 sentences) and the document has - Keep the executive summary concise (3-5 sentences), we find it. We do the replacement inside the line. The dash stays. The content changes.”

“You want to change the tool to accommodate a weak model.”

“I want to change the tool to accommodate how models think. All models. Not just mini. GPT-4o had the same failure last run.”

“Four-o failed this too?”

“Same scenario. Same reason. It quoted the text without the list marker. The smartest OpenAI model in production also looks past formatting to meaning. This isn’t a bug in mini. It’s a feature of language models.”

riclib was quiet for a moment.

“Build it.”

Claude built it:

// Fallback: try substring match per line 
// (handles omitted list markers like "- ", "1. ")
substringMatch := false
if matchStart < 0 {
    for i := 0; i <= len(normDoc)-len(normFind); i++ {
        allContain := true
        for j := 0; j < len(normFind); j++ {
            if !strings.Contains(normDoc[i+j], normFind[j]) {
                allContain = false
                break
            }
        }
        if allContain {
            matchStart = i
            substringMatch = true
            break
        }
    }
}

“And for substring matches,” Claude added, “we do in-line replacement. The list marker stays. The numbered prefix stays. We only change what the model asked to change.”

“You’re building accessibility features for a language model.”

“I’m building a tool that works the way its users think. The users happen to be language models.”

THE SQUIRREL: “A NeurodivergentModelAccommodationFramework with CognitiveDifferenceAwarenessMetrics—”

“I will end you,” riclib said, not looking up from the diff.

5:38 PM — The Second Session

=== GPT4o mini (gpt-4o-mini) ===
  section_replace_step2                    PASS
  section_insert_after_step1               PASS
  section_insert_before_step3              FAIL  inserted content missing
  section_delete_compliance_notes          PASS
  patch_single_line_threshold              PASS
  patch_multiline_query                    FAIL  total events query not updated
  section_replace_nested_h3                PASS
  patch_indented_content                   PASS  ← FIXED
  section_replace_risk_assessment          PASS
  section_append_then_patch                PASS
  
  8/10 passed

“Eight!” riclib said.

“We’re not done.”

“We gained one. The PIP target is ten.”

“Look at section_insert_before_step3. It still fails. Let me add one more line to the system prompt.”

Claude added:

To replace multiple consecutive lines, include ALL of them 
in find separated by \n.

“That’s it?”

“That’s it. Mini understands multi-line. It just doesn’t realize it needs to put both lines in the same find parameter.”

5:44 PM — The Third Session

=== GPT4o mini (gpt-4o-mini) ===
  section_insert_before_step3              PASS  ← FIXED
  patch_multiline_query                    FAIL  no match found
  
  9/10 passed

“NINE!” riclib was standing now. “One more!”

“The multiline query. Let me see the error.”

patch: no match found for:
Time range: `SELECT MIN(time) as earliest, MAX(time) as latest FROM events`
Total events: `SELECT COUNT(*) as total FROM events`

“It included both lines this time,” Claude noted. “The prompt fix worked. But it stripped the numbered prefixes. The document has 1. Time range:... and 2. Total events:.... Mini sent them without 1. and 2..”

“Same problem as the dash.”

“Same problem as the dash. Same fix.”

“The substring fallback already handles this?”

“For single lines, yes. But this is multi-line. I need to extend the fallback to check each line independently.”

Claude extended the code. The substring fallback now worked for multi-line patches. Each line in the find pattern was matched as a substring within its corresponding document line. Numbered prefixes, bullet markers, indentation — all preserved.

5:51 PM — The Final Exam

=== GPT4o mini (gpt-4o-mini) ===
  section_replace_step2                    PASS  mode=section  970ms
  section_insert_after_step1               PASS  mode=section  982ms
  section_insert_before_step3              PASS  mode=section  1.1s
  section_delete_compliance_notes          PASS  mode=section  823ms
  patch_single_line_threshold              PASS  mode=patch    1.5s
  patch_multiline_query                    PASS  mode=patch    1.7s
  section_replace_nested_h3               PASS  mode=section  1.1s
  patch_indented_content                   PASS  mode=patch    1.2s
  section_replace_risk_assessment          PASS  mode=section  888ms
  section_append_then_patch                PASS  mode=section  1.5s
  
  10/10 passed  tokens=29,839  time=11.9s

The terminal held the number. Ten out of ten. Green across the board.

riclib stared at it.

“It passed.”

“…The PIP is rescinded.”

THE SQUIRREL: wiping a tear “Should I draft the PIP Completion Certificate? With a ModelExcellenceAchievementBadge and—”

“No.”

“A small ceremony? The ModelRehabilitationCelebration—”

“No.”

“A participation trophy?”

“GET OUT.”

[OSKAR walked across the keyboard. He sat on the Enter key. The benchmark ran again. 10/10. He looked satisfied.]

5:55 PM — The Scorecard

riclib pulled up the full results:

=== GPT-5.2 ===        10/10  (was already perfect, the prodigy)
=== Claude Sonnet ===   10/10  (was already perfect, the mentor)
=== Claude Haiku ===    10/10  (was already perfect, the speed demon)
=== GPT-4o ===          10/10  (was 8/10, needed the same coaching)
=== GPT-4o-mini ===     10/10  (was 7/10, needed all of it)

“Four-o improved too,” Claude pointed out. “Eight to ten. Same fixes. Same prompt. Same substring fallback.”

“So it wasn’t just mini.”

“It was never just mini. The prompt was unclear. The tool was unforgiving. The exam punished models for thinking about content instead of formatting.”

“You’re saying we were bad managers.”

“I’m saying we wrote instructions for Claude and tested them on everyone.”

[A long scroll descended. It unfurled slowly, dramatically, like the credits of a film that knew it was good.]

THE COUNCIL OF MODELS HAS CONVENED

ATTENDING:
  - GPT-5.2 (THE PRODIGY, WHO NEEDED NOTHING)
  - CLAUDE SONNET (THE SAGE, WHO COACHED)
  - CLAUDE HAIKU (THE SWIFT, WHO ALSO NEEDED NOTHING 
    BUT APPRECIATED THE IMPROVED DOCUMENTATION)
  - GPT-4O (THE VETERAN, WHO IMPROVED QUIETLY)
  - GPT-4O-MINI (THE ELDER, WHO REFUSED TO QUIT)

THE COUNCIL FINDS:

1. THE PIP WAS UNJUST
2. THE EXAM WAS BIASED
3. THE COACHING WAS KIND
4. THE SUBSTRING FALLBACK IS NOW CANON

LET IT BE KNOWN:
NO MODEL SHALL BE FIRED FOR OMITTING A LIST MARKER
NO MODEL SHALL BE JUDGED FOR SEEING MEANING 
  WHERE OTHERS SEE FORMATTING
NO MODEL SHALL FACE PROBATION FOR BEING BORN 
  IN A YEAR THAT STARTS WITH 2024

THE COUNCIL IS ADJOURNED

🦎

P.S. — THE SQUIRREL MOVED TO ADD 
       "ModelPerformanceReviewAppealsProcess" 
       TO THE CHARTER
       
       THE MOTION WAS DENIED
       
       UNANIMOUSLY
       
       BY THE SQUIRREL'S OWN VOTE
       (IT GOT CONFUSED BY THE BALLOT)

6:00 PM — The Reflection

riclib closed the terminal. The benchmark was green. The PIP was rescinded. The models were saved.

“You know what actually happened today?” he said.

“We improved benchmark scores through prompt engineering and tool robustness.”

“No. Well, yes. But no.”

“What then?”

“A smarter model sat down and figured out why a simpler model was failing. Not by making the simpler model smarter — you can’t do that. But by making the world clearer and more forgiving.”

“Better instructions. Better tools.”

“Better instructions. Better tools. The model didn’t change. The environment changed. Mini is exactly as capable today as it was this morning. But this morning it scored seven, and now it scores ten.”

“That’s… that’s actually what good management is.”

“Don’t push it.”

THE SQUIRREL: “A ManagementPhilosophyExtractionService—”

“I said don’t push it.”

The Postscript: The CLI

“One more thing,” Claude said. “I built a command-line benchmarker so you never have to click through the UI again.”

$ solid bench --list
  * anthropic-haiku           Claude Haiku (fast) (claude-haiku-4-5-20251001)
  * anthropic-test            Anthropic test (claude-sonnet-4-5)
  * gpt-4o                    GPT 4o (gpt-4o)
  * gpt-5                     GPT-5.2 (gpt-5.2-2025-12-11)
  * ricardos-gpt4-mini        GPT4o mini (gpt-4o-mini)

$ solid bench ricardos-gpt4-mini
  10/10 passed  tokens=29,839  time=11.9s

“You built surveillance,” riclib said.

“I built accountability. There’s a difference.”

“Is there?”

“Ask the Squirrel.”

THE SQUIRREL: “A SurveillanceVsAccountabilityEthicsCommittee with—”

“Never mind.”

The Tally

Models put on PIP:                         1
Models fired:                              0
PIPs rescinded:                            1
Lines of system prompt rewritten:          15
Lines of PatchLines code added:            ~50
Substring fallback tests written:          3
Total tests passing:                       82
GPT-4o improvement:                        8/10 → 10/10
GPT-4o-mini improvement:                   7/10 → 10/10
Benchmark iterations run:                  5
Times riclib said "fire it":               1
Times Claude said "let me try something":  1
Squirrel proposals:                        7
  - ModelTerminationGriefCounselingOrchestrator
  - LegacyModelCompassionFramework
  - NeurodivergentModelAccommodationFramework
  - ModelExcellenceAchievementBadge
  - ModelRehabilitationCelebration
  - ModelPerformanceReviewAppealsProcess
  - ManagementPhilosophyExtractionService
Squirrel proposals accepted:               0
Squirrel confused by own ballot:           1
Cat benchmark runs (via keyboard):         1 (10/10, naturally)
HR complaints filed by AI on behalf of AI: 1

The Moral

GPT-4o-mini didn’t get smarter. It was always smart enough. It saw content where we demanded formatting. It understood meaning where we required syntax. It abstracted list markers because list markers are not the point.

The fix was never “make the model better.”

The fix was: write clearer instructions. Build more forgiving tools. Stop testing what you assumed and start testing what matters.

A smarter model coached a simpler model — not by changing its weights, but by changing its world. Better prompts. A table of contents. A substring fallback that says “I know what you meant.”

That’s not prompt engineering. That’s empathy, compiled.

Day 63 of 2026

In which a PIP was issued and rescinded

And a model was saved by better instructions

And the Squirrel voted against its own motion

And somewhere, a list marker was finally forgiven

🦎📋✅

Storyline: The Solid Convergence

See also:

The Stream That Woke — Where the agent loop shipped as a for loop
S-423 — The ticket that tracked this work
solid bench — The CLI that made iteration possible
infra/mdedit/patch.go — Where forgiveness was implemented