← All articles
AI7 min read

A Good SKILL.md Is the Cheapest Reliability Upgrade You'll Make

SkillSpector tells you whether a Claude Code skill is safe. Nothing told you whether it was any good, or how much its prose was costing you. So I built the quality layer and benchmarked it: across Haiku, Sonnet, and Opus, a strict schema and one worked example moved structured-output correctness from 0% to 100%.

0:000:00

You write a SKILL.md, point Claude Code at a pile of customer orders, and ask for JSON. It reads your instructions. It extracts every field correctly. Then it hands back caller_name where your database column is name, and your pipeline throws on the insert.

The model didn't fail at reasoning. It failed at the contract. And the fix was sitting in the SKILL.md the whole time.

I benchmarked this properly. A strict SKILL.md is the cheapest reliability upgrade you can make to an agent: across three Claude tiers, the right skill file took structured-output correctness from 0% to 100%.

TL;DR:

  • A high-quality SKILL.md (strict schema plus one worked example) scored 100% correctness on Haiku, Sonnet, and Opus. A vague one scored 0%.
  • A vague SKILL.md is worse than no SKILL.md. No skill passed 2 of 8 scenarios; the chatty skill passed 0 and burned extra tokens doing it.
  • Bad skills get more expensive on smarter models: 7.6x wasted output tokens on Haiku, 12x on Opus.
  • Most "the AI got it wrong" is a contract failure (field names), not a reasoning failure. A schema fixes it.
  • A high quality score is a craft linter, not proof the skill is correct. Still test it.

The gap I was trying to fill

SkillSpector answers one question about a Claude Code skill: is it safe? It returns a risk score where 0 is harmless and 100 is dangerous. Useful, and incomplete. A skill can be perfectly safe and still be bloated, vague, and expensive to run on every single call.

Nothing scored the craft. So I built skillspector-quality, a standalone layer that imports SkillSpector read-only and adds a deterministic 0 to 100 quality score next to the security one. It rewards information density, readability, topic coverage, and structural coherence, and it stays length-neutral so more genuinely useful content never hurts you. The dimensions are grounded in published research on agent context files (Gloaguen et al., 2026), which found that redundant prose inflates token cost by 20 to 23% with no gain in task success.

A score is a claim, though. I wanted proof that the things it rewards actually change outcomes. So I ran a benchmark.

The experiment

Three versions of the same skill, eight extraction tasks, ten repeats each. That is 240 LLM calls per model at temperature 0.3, every call stamped with a unique nonce so provider-side caching couldn't skew the numbers.

The three arms:

  • Baseline: a generic "extract this as JSON" prompt, no SKILL.md at all.
  • LowQuality: a conversational SKILL.md. No schema, no example, just friendly prose.
  • HighQuality: the same skill rewritten on this library's recommendations. Strict schema, exact field names, one worked example.

The eight scenarios were ordinary back-office work: customer orders, meeting notes, bug triage, product reviews, incident reports, support email classification, job postings, and one deliberately ambiguous ticket. Correctness meant valid JSON with the right shape and the right field names, checked programmatically. No human grading, no partial credit.

Finding 1: 0% to 100%, on every model

ModelBaselineLowQualityHighQualityLowQuality token waste
claude-haiku-4-54%0%100%7.6x (359 vs 47)
claude-sonnet-4-625%0%100%10.4x (428 vs 41)
claude-opus-4-825%0%100%12.0x (778 vs 63)

The HighQuality skill hit 100% on all three tiers. Same prose, same schema, same example, whether the model underneath was the cheap one or the expensive one.

That is the headline for me. Reliability stopped depending on the model. A SKILL.md with a strict schema and a worked example gives you the same correct output on Haiku that you get on Opus. So you can run the cheap tier in production and stop paying for capability you don't need.

A vague SKILL.md is worse than none

Look at the LowQuality column again: 0% everywhere. Baseline, with no skill file at all, passed 2 of 8 scenarios. The conversational skill passed zero, and it added hundreds of tokens to every call to get there.

That is the part worth sitting with. A friendly, helpful-sounding SKILL.md with no schema is not a mild improvement over nothing. It is a regression you pay rent on. If your skill file reads like a Slack message, you would get more reliable output by deleting it.

Finding 2: smarter models punish bad skills harder

Token waste on the LowQuality arm climbed with capability: 7.6x on Haiku, 10.4x on Sonnet, 12.0x on Opus. Opus burned 778 output tokens of reasoning prose for 0% correctness, against 63 tokens on the clean skill.

The mechanism is simple. The more capable the model, the more faithfully it follows your chain-of-thought instructions, so a rambling SKILL.md gets elaborated on instead of ignored. A sloppy skill gets more expensive as you move up the model ladder, not less. Your savings from fixing it are biggest exactly where tokens cost the most.

Finding 3: the failures are contracts, not reasoning

The Baseline runs are the tell. The model extracted the correct data every single time. It just named things its own way: caller_name instead of name, tasks instead of action_items, status: "Resolved" instead of resolved: true.

That is a contract failure, and a contract is exactly what a schema is. Haiku guessed conventional key names 4% of the time; Sonnet and Opus reached 25% because they lean toward common naming. None of that is something you want to leave to a coin flip in production. Pin the field names and the guessing stops.

What the good skill actually looked like

The whole jump came from three changes. Here is the shape of it, trimmed down.

Before, scored 0%:

---
name: order-extractor
description: Helps with customer orders
---
 
Pull the important details out of the customer's order text
and give them back as JSON. Be thorough and helpful.

After, scored 100%:

---
name: order-extractor
description: Use when extracting customer order data from raw text into JSON. Do not use for general summarization.
---
 
Extract each order into this exact JSON shape. Use these field names verbatim.
 
{
  "name": string,
  "items": [{ "sku": string, "qty": number }],
  "total": number
}
 
## Example
 
Input:  "Jane Doe ordered 2x SKU-12 and 1x SKU-99, total 48.00"
Output: {"name":"Jane Doe","items":[{"sku":"SKU-12","qty":2},{"sku":"SKU-99","qty":1}],"total":48.00}

A description that says exactly when to use it and when not to, an exact schema, one example. That is the entire diff. No clever prompt engineering, no model-specific tuning.

The honest part: a high score is not a green light

I built the scorer, and I will still tell you what it can't do. It measures form, not truth. That is the high-score illusion: a skill can earn 95 out of 100 and still command the model to do something impossible or wrong, because deterministic grading reads structure, not correctness.

Concretely, Topic Coverage checks that your description and body share vocabulary, not that either is accurate. Code Maintainability measures how clean a script looks, not whether it runs. Readability rewards prose that is easy to parse, and confident nonsense parses just fine.

So read the number as a linter for craft, the cheap first pass that catches vague descriptions and missing schemas. Then test the thing against real inputs. The score gets you to step one. It never replaces step two.

Try it on your worst skill

Pick the SKILL.md you trust least. Give it a description that names exactly when to use it and when not to. Add a strict output schema with the field names your downstream code actually expects. Drop in one worked input/output example. Re-run your task.

If your results look anything like mine, that is the gap between 0% and 100%, and you bought it for the price of a schema.

skillspector-quality is open source under MIT, and the benchmark harness ships with it, so you can run these numbers on your own scenarios. The context-engineering habits behind this, keeping skills cheap and deterministic, live in my Claude Code best practices post.

Which of your skills would survive this test? Tell me on LinkedIn.

What do you think?

Common questions

What is a SKILL.md file?
A SKILL.md is the instruction file Claude Code loads to teach an agent a specific capability, like extracting orders into JSON. It holds frontmatter (name, description, when to use) and a body with the rules, schema, and examples the agent should follow.
Does a better SKILL.md improve accuracy across different models?
Yes. In a benchmark of 8 extraction scenarios at N=10, a high-quality SKILL.md with a strict schema and one worked example produced 100% structured-output correctness on claude-haiku-4-5, claude-sonnet-4-6, and claude-opus-4-8. The same task with a vague SKILL.md scored 0% on all three.
Is a vague SKILL.md worse than no SKILL.md at all?
In this benchmark, yes. With no skill file, the model passed 2 of 8 scenarios by guessing conventional field names. A conversational SKILL.md with no schema passed 0 of 8 and added hundreds of tokens per call, so it cost more and delivered less.
Why does a bad SKILL.md cost more tokens on more capable models?
More capable models follow chain-of-thought prose more faithfully, so they spend more tokens elaborating on a poorly written skill. The low-quality skill wasted 7.6x the output tokens on Haiku, 10.4x on Sonnet, and 12.0x on Opus, all for 0% correctness.
Does a high skill quality score mean the skill actually works?
No. skillspector-quality scores how well-formed a skill is: structure, density, readability, topic coverage. It cannot verify that the instructions are correct. A structurally perfect skill can still be logically broken, so treat the score as a craft linter and still test against real inputs.
What makes a SKILL.md high quality?
A specific description with clear trigger and exclusion conditions, a strict output schema with exact field names, and at least one worked input/output example. Those three changes alone account for the jump from 0% to 100% correctness in the benchmark.
Lars Roettig

Lars Roettig

Senior Technical Architect writing about AI, engineering, and building things that last.

LinkedIn →

// recommended

You might also enjoy