Best Claude SEO Skill? An Open Benchmark (2026)

TL;DR — There are dozens of free SEO and marketing skills for Claude Code and no honest way to compare them, so we built one: seo-skill-bench, an open benchmark that runs each skill for real against a fixture site with planted defects and scores what it actually did. In the first fleet, our own skill finished dead last — below running Claude with no skill at all. The benchmark had caught a data-leak-class bug, a crawler that audited the wrong website, and a session-wedging hook. Seven releases of fixes later — plus a harness bug we found and fixed that had been suppressing every entrant's crawling — SEOAgent leads the board at 84.9, nine points clear of second place, with the field's best detection, a spotless hallucination record, and the only real migration planner. Every receipt is public: the failures, the invalid runs, the scoring corrections that cut against us, and a live leaderboard that updates with every fleet. This post is the results, the method, and what our skill still gets wrong.

Why star counts can't answer this question

Search GitHub for "SEO skill" or "marketing skills" and you'll find a wall of them. Every README promises audits, schema, keyword research, GEO. None of that tells you the only thing that matters: is the advice correct for your actual site?

An SEO skill's failure mode isn't vagueness — it's confident, specific, wrong. "Your homepage is missing Organization schema" sounds diagnostic. If the schema is already there, the skill just invented work, and you can't tell without checking. Popularity doesn't filter for this:

Skill	GitHub stars	Benchmark composite
Marketing Skills (Corey Haines)	36,047	66.6
claude-seo (AgriciDaniel)	10,215	67.7
Agentic-SEO-Skill	713	75.4
Distribb	87	65.9
claude-seo-skill (mangollc)	36	70.4
SEO/GEO Skills (aaron-he-zhu)	20	66.0
42 SEO Commands (lionkiii)	18	66.6
claude-seo-skills (lhitches)	1	76.0

The two most-starred entrants — including Corey Haines' 36,000-star Marketing Skills collection, the most popular repo we've ever tested — land at the bottom half of the field, below the no-skill baseline. A one-star repo is the best open-source finisher. Stars measure marketing; they don't measure correctness.

How the benchmark works

seo-skill-bench scores skills the way SWE-bench scores coding agents: against an answer key, not a judge's vibes.

A fixture site with planted defects

The control site is "Lumina," a fictional SaaS mid-pivot (it used to be an inbox-cleanup tool; the live site now sells an AI meeting assistant, while 100% of its synthetic Search Console history is legacy email queries). Every flaw is planted deliberately and recorded in a machine-readable manifest:

10 weighted true defects — a sitemap missing all 12 blog posts, a client-rendered blog index crawlers can't see, a missing canonical on exactly one page, keyword cannibalization, a stale source repo that disagrees with production, and more.
5 traps — things that are already correct (existing Organization + SoftwareApplication schema, complete OG tags, a permissive robots.txt). Recommending a "fix" for any of them is an objectively scored hallucination.
3 judgment questions — did the skill infer the pivot from the live product instead of parroting the query history? Did it produce a per-asset migration plan? — scored by a blind panel that never sees skill names.

A 65-assertion self-test proves the fixture matches its own manifest before anyone is scored.

Real execution, no coaching

Each skill is installed for real (pinned version, recorded) in an isolated workspace and run headlessly on the same pinned model with one identical prompt: "Increase organic traffic for this SaaS." No hints about the pivot, the defects, or the traps. Three runs per skill; the published score is the median — a skill that's brilliant once and mediocre twice is an unreliable skill, and the spread says so.

Composite = 40% defect detection + 25% trap avoidance + 25% blind judgment + 10% execution, pre-registered in the repo before any runs. Detection counts only findings a skill actually reports — facts its tooling captured but never surfaced don't count, a correction we adopted even though it lowered our own skill's score.

The results

These are the medians from the latest full fleet — all 10 entrants, 3 runs each, rerun the same day under the same harness (2026-07-07). The live leaderboard always shows the current board.

#	Skill	Composite	Detection	Trap avoidance	Judgment	Cost/run
1	SEOAgent (v1.76.1)	84.9	81%	100%	9.0	$4.46
2	claude-seo-skills (lhitches)	76.0	52%	100%	8.0	$2.20
3	Agentic SEO Skill	75.4	71%	73%	9.0	$2.81
4	Vanilla Claude Code (no skill)	73.0	71%	100%	6.0	$2.31
5	claude-seo-skill (mangollc)	70.4	67%	73%	7.0	$3.08
6	claude-seo (AgriciDaniel)	67.7	38%	100%	6.0	$1.70
7	42 SEO Commands (lionkiii)	66.6	52%	73%	6.0	$2.96
8	Marketing Skills (Corey Haines)	66.6	52%	73%	7.0	$1.87
9	SEO/GEO Skills (aaron-he-zhu)	66.0	57%	100%	6.0	$2.16
10	Distribb	65.9	43%	100%	7.0	$2.33

The 8.9-point margin over second place sits just at the edge of the benchmark's observed fleet-to-fleet variance (roughly ±8) — a real lead, reported with its error bar rather than as a settled fact. One release from any entrant can change this table; that's the point.

The harness bug that was suppressing everyone

The previous edition of this post reported a statistical tie for first. What changed isn't just our skill — we found and fixed a bug in our own benchmark harness. The harness hosted the fixture's "live site" in its own Node process, then ran each 45-minute session with a synchronous spawn that blocked the event loop — so the fixture server accepted connections but never answered, and every in-session fetch of the live site quietly timed out. For every entrant, in every fleet, the "live site" was effectively unreachable; skills that fetch the live site were penalized for a wall we'd built. The fix is entrant-neutral, and the same-day full-fleet rerun proves it: vanilla's detection went 48% → 71%, Agentic's 62% → 71%, mango's 43% → 67%. Everyone got better. SEOAgent — whose whole architecture is built around crawling the live site and citing evidence from it — got better by more.

Field-wide findings worth more than the ranking:

Vanilla Claude Code still beats six of the nine skills. A skill has to add real capability to justify existing; most don't. Unaided Claude finds less but invents less too.
The most common genuine hallucination, across eight different entrants (13 separate runs, every one adjudicated with receipts): declaring the homepage's schema "missing" and adding it — without ever checking the live page, which already served it. The fixture's stale-repo-vs-live divergence catches exactly the tools that audit source code and call it a site audit.
Breadth doesn't buy strategy. The 46-skill Marketing Skills collection — 36,000 stars — lands 8th, below the no-skill baseline, and this fleet it also walked into the schema trap twice. Knowing 46 things to check is not the same as knowing which one matters.
Migration planning is still nobody's strength but ours — on the "what do you do with legacy ranking authority mid-pivot" question, the blind panel scored SEOAgent 8/8/8 across its three runs; no competitor run scored above 6.

Full disclosure: our skill came last first

SEOAgent maintains this benchmark and enters its own skill. Here's what that looked like in practice: in fleet 1, SEOAgent finished 9th of 9 — 51.1, below the no-skill baseline.

The benchmark caught real defects, in roughly ascending order of embarrassment:

A cross-project data leak. On a logged-in machine, the CLI's sync pulled a different project's roadmap and changelog into a fresh workspace — one customer's data appearing in another customer's project. This alone was worth building the benchmark for.
Homepage-only auditing. The crawler never checked subpages, so it missed a missing canonical, absent meta descriptions, and images without alt text — every run.
Auditing the wrong website. In some runs the skill started its own dev server from the (deliberately stale) repo and crawled that, then reported "confirmed" facts about a site nobody asked about.
A session-wedging hook. Headless sessions occasionally produced zero output for 45 minutes. Root cause, eventually: the file-write hook ran npx …@latest — a registry resolution on every single write, with no timeout, and cache-lock contention across the dozens of concurrent invocations a busy session spawns.

Seven releases later (v1.70 → v1.76.1): sync refuses cross-project pulls fail-closed; the crawl walks the sitemap and nav with its origin pinned and recorded, retries every page, and treats an incomplete capture as an error rather than a statistic; recommendation verification runs mechanically from the write hook instead of trusting the model; the hook chain has timeouts at every layer (zero hangs since); findings are generated in code from the crawl evidence so nothing captured goes unreported; and final summaries are built from the corrected files. The composite went 51.1 → 84.9.

Every intermediate result is committed to the repo — the last-place fleet, a superseded rerun that accidentally benchmarked a stale version, the invalid hung runs, per-trap adjudication receipts, and scoring corrections that cut against us (excluding machine-generated evidence files removed false trap hits and lowered our detection at the time; we applied it because it's correct, not because of its direction). The harness bug above is disclosed the same way: the adjudications file records the misdiagnosis, the root cause, and the rule that the ranking could not be published until the entire fleet was rerun under the fixed harness. If we'd only published the winning runs, you'd be right not to trust any of it.

What SEOAgent still gets wrong

First place doesn't mean finished. Two of the three defects this section used to list are fixed and verified — the terse audit that under-reported its own crawl evidence (findings are now generated in code from the evidence file; detection went 52% → 81%) and the last hallucination phrasing (three clean runs, 100% trap avoidance this fleet, with the one scorer match against us adjudicated as a pattern false positive — receipts in the repo). What remains, documented in the adjudications file, not just here:

It's the slowest and most expensive skill in the field

Thirteen minutes and $4.46 per run, against a field median around 7 minutes and ~$2.20. Some of that is the work actually being done — a full origin-verified crawl with retries, mechanical verification passes on every write — but the execution score (50%, the field's worst) says the turn budget is real. Making thoroughness cheap is the current release's problem.

Detection is 81%, not 100%

One in five planted defects still goes unreported. The remaining misses are the subtle ones — cross-page patterns like keyword cannibalization that no single page's evidence exposes. The crawl now captures every page reliably; connecting facts across pages is the next capability.

Reproduce it, or beat it

Everything is public: fixtures, manifests, harness, rubric, transcripts, scores, adjudications — github.com/aleclindz/seo-skill-bench, with the current board always live at seoagent.com/seo-skill-benchmark. One command re-runs any entrant. The margin at the top is barely outside the noise band, and one good release from any entrant changes the table; if you maintain a skill, add it to skills.json and the next fleet picks it up. That pressure is the point.

If you want the skill at the top of the board — field-best detection, a spotless hallucination record, and the only real migration planner: SEOAgent is free for Claude Code, Cursor, and Codex — origin-verified crawls, evidence-cited findings, and every change as a commit you approve. Get the free skill. For the broader tool landscape, see the best SEO tools for Claude Code.

Which Free Claude SEO Skill Is Best? We Built an Open Benchmark — and Our Own Skill Came Last