Skip to main content

Benchmarks

Confidence comes from artifacts.

Phonton is designed for context efficiency and proof-carrying development. Public claims against Cursor, Claude Code, Codex, HermesAgent, BridgeSpace, or other ADEs require reproducible benchmark packets.

Phonton benchmark evidence packet with fixtures, prompts, tool versions, raw logs, final diff, verification, review artifacts, token records, and claim rules.
Benchmark claims require reproducible packets, not screenshots or marketing claims.

Comparison protocol

An ADE benchmark needs the whole run.

Before

Pin repo, commit, prompt, tool versions, model route, and allowed capabilities.

During

Capture raw logs, provider usage when available, tool calls, retries, and verifier output.

After

Publish final diff, review artifact, quality notes, rollback path, and cost summary.

Required packet

Every comparison should be replayable.

Fixture repo

Pinned repository and commit before the run starts.

Prompt

Exact goal text, including file and MCP mentions.

Tool versions

Phonton version, model/provider route, and comparator versions.

Raw logs

Provider usage, command output, tool calls, retries, and failures.

Final diff

The produced patch and changed-file summary.

Verification

Syntax, build, test, runtime, or failure diagnostics.

Review artifact

HandoffPacket or nearest equivalent completion summary.

Quality review

Human or automated quality notes with reproducible criteria.

Current public artifact

The existing benchmark is intentionally narrow.

The planner-preview report is useful release evidence, but it is not a provider invoice, not a cached-token measurement, not an end-to-end quality score, and not a competitor comparison.

Planner preview batch

Measures plan preview time and estimated context reduction only.

Provider tokens

Not measured by the current public artifact.

Competitor comparison

Not claimed without fixed fixtures and raw evidence from every tool.

Claim rule

Say what is proven, and separate what is designed.

Allowed:
Phonton is designed for context efficiency and visible proof.

Not allowed without artifacts:
Phonton uses 90% fewer tokens than another ADE.
Phonton beats Cursor, Claude Code, Codex, HermesAgent, or BridgeSpace.