Skip to main content

Measurement

Claims should be reproducible or absent.

Phonton benchmarks should read like audit logs: fixed tasks, pinned commits, provider disclosure, verification outcomes, and correction burden.

Benchmark plan

Measure the full loop, not only token volume.

Task class

Rust repo change

Small real changes with tests, cargo checks, and reviewable diffs.

Primary result

Verified completion

A task only counts when the verification gate passes or failure is correctly escalated.

Cost signal

Provider spend

Track model tier, retries, tokens, and configured provider pricing separately.

Quality signal

Review burden

Record how much human correction remains after Phonton marks work ready.

Disclosure

No inflated launch claims.

CommitpinnedEvery run links to the exact repo state.
ProviderdeclaredModel and routing settings are part of the result.
ChecksreportedSyntax, workspace, and test outcomes are shown.

Run format

A benchmark result should be easy to audit.

01

Pin the task

Fix repo commit, task prompt, provider config, and expected verification command.

02

Run the loop

Capture plan, retries, checks, tokens, cost, and final review payload.

03

Publish the result

Show verified completion, failure mode, or human correction burden without hiding misses.