Skip to main content

Same essay, same score. Every time.

An IELTS scorer you can't trust is worthless — so we make two promises you can verify yourself: scoring is consistent by design (re-submit any essay and get the identical result), and accuracy against human examiners is measured weekly and published here, good or bad.

The consistency contract — test it yourself

Submit the same essay twice. You'll get the identical result — to the word.

The first time an essay is scored, that result is sealed as canonical for that essay and task. Re-submit it a minute later or a month later and the sealed result is returned byte-for-byte — same band, same criterion scores, same feedback, same corrections. Changing spacing, capitalisation, or curly quotes doesn't fool it; the essay is content-addressed, not text-matched.

Two honest footnotes: the quick preview that streams in first is provisional — the final score that replaces it seconds later is the sealed one. And a different essay (even one changed word) is scored fresh, because one word can legitimately move a band. If you ever see the same essay produce two different final scores, that is a bug and we want to know: hi@bandnine.ai.

Accuracy, measured weekly

Every Saturday we re-score a set of essays graded by IELTS examiners and publish the deviation. No other AI IELTS scorer publishes this. We do because at this price, you deserve to see the numbers.

Latest measurement · 13 June 2026
Mean Absolute Error
0.40
Examiner-level
bands deviation vs examiners
00.5 · examiner-level1.0 · close1.5+
This is within the ±0.5 band that human examiners themselves disagree by — our AI is scoring at examiner level.
Task Response0.67
Coherence & Cohesion0.42
Lexical Resource0.40
Grammar Range & Accuracy0.46
Worst single essay1.00 bands
Calibration set size26 essays
Drift vs prior week+0.00
Green — MAE ≤ 0.5

Matches the ±0.5 band that human examiners disagree by. Indistinguishable from a human marker.

Amber — 0.5 to 1.0

Within one band. Reliable for guiding practice, not yet exam-exact.

Red — above 1.0

Off by more than a full band. We publish it and are actively tuning the model — no hiding the miss.

The guardrail stack

Six mechanisms — each one live in production code — that keep the scoring honest, consistent with the IELTS rubric, and free of hallucinated feedback.

Sealed first score

Every (essay, task) pair gets one canonical result, stored against a content hash with the model version pinned and temperature locked to 0. Re-submissions replay it exactly — scores can't wander between attempts.

Official IELTS rubric, enforced

The examiner prompt embeds the public IELTS band descriptors for all four criteria. The maths is then re-checked server-side: every band must sit on the official 0.5-step scale, and the overall must equal the rounded mean of the four criteria — if the model returns anything else, the server corrects it before you see it.

No invented quotes

When feedback claims you wrote something, the claim is checked verbatim against your actual essay or transcript. Any "correction" quoting words you never wrote is dropped before the response leaves the server.

Deterministic error scan

A rule-based grammar layer (not AI) scans every essay and appends high-confidence errors the model under-reported. Rules are pure functions: the same essay always produces the same hits.

Drift alarms + a build-blocking guard

If a live re-score of a known essay ever drifts, it's logged and alarmed. And every deploy runs an automated determinism check — if any of these guarantees regresses in code, the build fails before it can reach you.

Weekly human benchmark

The accuracy report above re-scores examiner-graded essays through the exact production pipeline every Saturday and publishes the deviation — including the weeks we don't like the number.

How we measure

We maintain a calibration set of essays graded by certified IELTS examiners — covering bands 5 through 9 across Task 1 and Task 2, Academic and General Training. Each essay carries a per-criterion ground-truth band score (Task Response, Coherence & Cohesion, Lexical Resource, Grammatical Range & Accuracy).

Every Saturday at 04:08 IST, an automated audit re-scores every essay in the calibration set through the production scoring endpoint — the same code path real students hit. It then computes the Mean Absolute Error per criterion against the examiner ground truth, plus the worst single-essay deviation, and stores the result in a public table.

If the overall MAE exceeds 0.5 bands — the rough margin within which two human examiners disagree — or any single essay drifts more than 1.0 band, the system opens an urgent issue and we investigate before users see degraded scoring.

No other AI IELTS scorer publishes this. We do because verifiable accuracy at this price is the actual differentiator.

Run history

DateEssaysMAEStatus
13 Jun 2026260.40 examiner-level
11 Jun 2026250.40 examiner-level
23 May 202651.20 improving

Want to verify? The calibration set and run history live in our public Supabase tables (RLS-readable by anyone with the project URL). Want to contribute essays to the calibration set? Reach out — we’ll send the schema. We grow the set over time as more ex-examiners agree to grade for us.

Try the scorer yourself →