Benchmarks

Fixed fixtures. Full artifacts. Honest comparisons.

Phonton publishes reproducible ADE benchmark packets from pinned repositories — not slogan leaderboards. This page summarizes what is measured, what passed verification, how Phonton compares to Claude Code, Codex CLI, Gemini CLI, and OpenCode on the same prompts where we have complete evidence, and what is still missing (including Cursor).

phonton-cli on GitHub Install CLI Workflow comparison

Benchmark evidence packet: fixture, prompt, versions, logs, diff, verification, receipt — A public token or cost claim requires verified_success plus token_claim_eligible and a complete artifact directory.

How to read this page

Three rules before you compare tools

1 · Same fixture

Every suite pins a repository commit, uses the same prompt.md, and records tool versions. Comparing unlike fixtures or tuned prompts is invalid.

2 · Verified means tests + logs

Verified means external checks passed (for example npm test or the syntax-preflight harness) — not “looked fine in chat.”

3 · Token claims need eligibility

Phonton sets token_claim_eligible: true only on provider-reported, verified runs. local-template runs (zero provider tokens) are reliability evidence, not efficiency wins.

Latest · v0.21.0

RunIndex 39 · phonton 0.21.0 · DeepSeek (provider-only)

Date 2026-06-20. Mode: PHONTON_DISABLE_LOCAL_SEEDS=1. Provider: DeepSeek (deepseek-v4-flash). Three fixture-scoped suites are claim-eligible on this batch.

Suite	Status	Provider tokens	Reported USD	Claim eligible	Notes
node-config-bugfix-v1	verified	7,620	$0.0049	yes	npm test passed; artifacts under phonton/39
node-receipt-refactor-v1	verified	7,620	$0.0049	yes	Fixed vs v0.20.1 run 38 (duplicate export failure)
chess-web-v1	verified	8,760	$0.0036	yes	Pinned Vite fixture; provider-only
memory-latency-v1	verified	—	—	no	Harness only; not a token leaderboard
syntax-preflight-v1	failed	8,760	$0.0036	no	Goal ran; external Python verify failed on host without Python PATH

Report: deepseek-2026-06-20-v0.21.0.md in phonton-cli.

Prior · v0.20.1

RunIndex 38 · phonton 0.20.1 · DeepSeek (provider-only)

Date 2026-06-01. Mode: PHONTON_DISABLE_LOCAL_SEEDS=1. Provider: DeepSeek via OpenAI-compatible route (deepseek-v4-flash).

Suite	Status	Provider tokens	Reported USD	Claim eligible	Notes
node-config-bugfix-v1	verified	5,538	$0.0029	yes	npm test passed; artifacts under phonton/38
chess-web-v1	verified	4,752	$0.0023	yes	Pinned Vite fixture; run 37 failed on same suite
memory-latency-v1	verified	—	—	no	Harness only (~0.28ms avg concurrent query); not a token leaderboard
node-receipt-refactor-v1	failed	—	—	no	Duplicate buildReceipt export; npm test failed
syntax-preflight-v1	failed	4,752	$0.0023	no	Goal ran; external Python py_compile failed (no Python on host PATH)

Prior write-up: deepseek-2026-06-01-v0.20.1.md in phonton-cli benchmarks/reports.

Cross-tool · Node fixtures

Same prompts · provider paths · 2026-05-21 batch

Tool-reported tokens are not normalized across vendors (cache accounting differs). Phonton was 0.16.1 in this batch; latest Phonton numbers are Run 38 above.

Suite	Tool	Tool exit	Verify	Wall time	Reported tokens	Cost	Notes
02 bugfix	Phonton	0	pass	38.6s	2,736	$0	verified_success
02 bugfix	Claude Code	0	pass	58s	203,018	$0.39	includes cache read in reported total
02 bugfix	Codex CLI	0	pass	219s	384,846	N/A	turn.completed usage
02 bugfix	Gemini CLI	0	pass	200s	214,287	N/A	CLI stats total
02 bugfix	OpenCode	timeout	fail	20m+	—	—	no diff before timeout
03 refactor	Phonton	1	fail	50.6s	10,198	$0	provider path; no verified diff
03 refactor	Claude Code	0	pass	157s	224,947	$0.61	8/8 tests
03 refactor	Codex CLI	0	pass	229s	460,462	N/A	4/4 tests
03 refactor	Gemini CLI	0	pass	118s	536,583	N/A	5/5 tests

Source: reports/actual-2026-05-21-provider/report.md in the benchmark workspace. Phonton used PHONTON_DISABLE_LOCAL_SEEDS=1 in this batch.

Cross-tool · verify outcomes

Exact-prompt external CLIs · 2026-05-19 (verify-only summary)

Phonton exact-prompt automation was blocked on Windows PTY; do not compare Phonton rows from this batch.

Suite	Tool	Tool exit	External verify
02 bugfix	Claude Code 2.1.143	0	5/5 pass
02 bugfix	Codex CLI 0.130.0	0	5/5 pass
02 bugfix	Gemini CLI 0.39.1	timeout*	5/5 pass
03 refactor	Claude Code 2.1.143	0	8/8 pass
03 refactor	Codex CLI 0.130.0	0	4/4 pass
03 refactor	Gemini CLI 0.39.1	timeout*	5/5 pass

Cross-tool · chess

chess-web-v1 · pinned fixture

End-to-end web app on a pinned Vite fixture. Competitor rows will be filled as runs complete the same artifact schema as Phonton.

Tool	Runs logged	Verified	Tokens	Claim eligible	Notes
Phonton	39	1	8,760	yes	Run 39 verified (DeepSeek provider-only, v0.21.0)
Phonton	38	1	4,752	yes	Run 38 verified (DeepSeek provider-only)
Phonton	37	0	16,773	no	Syntax failure on chessRules.test.ts
Codex	5	0	N/A	no	Legacy artifacts normalized; no claim-eligible usage recorded
Claude Code	0	—	—	—	Not yet captured in public artifact bundle
Cursor	0	—	—	—	Not yet captured in public artifact bundle

Phonton history

RunIndex 36 · phonton 0.19.7 · DeepSeek (provider-only)

Prior provider-only batch on v0.19.7 (run 36). Run 39 on v0.21.0 adds a verified refactor suite and refreshed chess result.

10,900node-config-bugfix-v1

claim-eligible

15,696node-receipt-refactor-v1

claim-eligible

verifiedmemory-latency-v1

verified

failedsyntax-preflight-v1

failed

Suite catalog

What each benchmark measures

Suite ID	What it measures	Verification	Token leaderboard?
`node-config-bugfix-v1`	Fix a focused Node config loader bug in a pinned repo; success requires `npm test` and a minimal diff in `src/config.js`.	npm test + git diff + Phonton verifier layers	Yes, when eligible
`node-receipt-refactor-v1`	Refactor a receipt renderer with tests-first discipline: `## Commands` section, gap sorting, verifiedBy metadata; success requires `npm test`.	npm test + syntax on touched JS	Yes, when eligible
`syntax-preflight-v1`	Repair intentionally broken Python, Rust, and TypeScript files from one goal prompt.	External harness: `python -m py_compile`, `rustc --emit=metadata`, `npx esbuild` (plus Phonton goal transcript)	Yes, when eligible
`chess-web-v1`	Build a playable chess web app on a pinned Vite + React + TypeScript fixture (not an empty folder).	npm test / project test script + syntax on rules + test files	Yes, when eligible
`memory-latency-v1`	Concurrent local memory query latency (HNSW harness inside phonton-memory).	cargo test benchmark harness	No

Tools in scope

How we measure each agent

Phonton CLI

ADE (local-first)

Headless `phonton goal` or TUI; exports `phonton-benchmark-export.json`, HandoffPacket, verifier logs.

Tokens: Provider-reported when `token_claim_eligible`; local-template runs labeled separately.

Claude Code

Terminal coding agent

Same `prompt.md` pasted or scripted; `/usage` or provider JSON for tokens.

Tokens: Often includes cache read/create in totals — not comparable 1:1 to Phonton export fields.

Codex CLI

Terminal coding agent (OpenAI)

Codex CLI session on the same workspace; turn.completed usage in transcript.

Tokens: Very high reported totals on multi-turn runs; compare verify outcome first.

Cursor

IDE-integrated agent

Not yet in the published cross-tool artifact tables; workflow comparison only below.

Tokens: Use Cursor usage UI when runs are added to the benchmark bundle.

Gemini CLI

Terminal coding agent

`/stats` after run; May 2025 batches show pass with Windows PTY timeouts.

Tokens: CLI stats total — treat as vendor-reported, not normalized.

OpenCode

Terminal coding agent

Included in May 2025 bugfix batch; timed out without diff on that fixture.

Tokens: opencode stats when available.

Workflow · not tokens

ADE vs IDE agent vs terminal agent

Token tables above compare cost on fixed fixtures. This table compares product shape — why Phonton is categorized as an ADE.

Dimension	Typical IDE agent (e.g. Cursor)	Terminal agent (Claude Code, Codex)	Phonton ADE
Primary surface	IDE (file/selection)	Terminal session	Terminal ADE + TUI
Unit of work	Edit selection / chat task	Prompt / task	Goal + GoalContract
Plan visibility	Often implicit	Varies by product	Plan preview + DAG slices
Pre-merge verification	Optional / product-dependent	Varies	Verifier path before review
Handoff artifact	Diff in editor	Diff + chat summary	HandoffPacket receipt
Session memory	Chat + codebase index	Chat history	Typed local memory + semantic index
Benchmark mode labels	Rarely exposed	Rarely exposed	provider vs local-template + eligibility flags

Required artifact packet

Every published run should be replayable

Missing fields mean the run stays out of comparison tables.

Fixture + commit

Pinned repo state before the agent starts.

prompt.md

Exact goal text for every tool.

Tool + model versions

CLI --version and provider route.

transcript.log

Raw session or goal output.

final.diff + git-status

What actually changed.

verify.log

npm test, harness, or verifier output.

token-usage.json

Provider-reported usage + token_claim_eligible.

handoff / receipt

HandoffPacket or nearest equivalent.

Reproduce

Run the latest Phonton batch locally

# Prerequisites: Node.js, phonton CLI 0.21.0+, provider key in ~/.phonton/config.toml

npm install -g phonton-cli
phonton doctor --provider

# Provider-only mode (no local template seeds):
export PHONTON_DISABLE_LOCAL_SEEDS=1   # Linux/macOS
# $env:PHONTON_DISABLE_LOCAL_SEEDS = "1"   # Windows PowerShell

# Clone the public benchmark fixtures (when published) or use your local copy.
# Run capture scripts with -RunIndex 39 -PhontonBinary phonton -ProviderOnly

# Validate evidence schema (PowerShell):
# .\benchmarks\validate-evidence.ps1 -SuiteId node-config-bugfix-v1

Claim boundary

Allowed vs not allowed on phonton.dev

Allowed today

Fixture-scoped Phonton results on Run 39 (bugfix, refactor, chess) with token_claim_eligible: true.
Cross-tool verify pass/fail on Node bugfix/refactor (May 2025 batches).
Memory latency harness metric (not provider tokens).
“Designed for proof-carrying / merge-gate workflows.”

Not allowed without new artifacts

“Phonton beats Claude Code / Codex / Cursor” globally.
Token % savings across vendors without normalized accounting.
Syntax-preflight wins until harness + provider verify both pass.
Chess or refactor headlines from failed or incomplete competitor runs.

Other artifacts

Planner preview (narrow scope)

The planner-preview batch measures plan latency and estimated context reduction — not end-to-end task quality or provider invoices.

Raw JSON Markdown report