AI Model Performance
NAPLAN Writing AI Comparable to Human Markers
Validated against the official NAPLAN writing rubric, our AI achieves a Quadratic Weighted Kappa of 0.86 — comparable to the level of agreement between trained human markers.
Become a ValidatorQuadratic Weighted Kappa
Within 1 Mark Accuracy
Exact Match Accuracy
Model Version: 20 February 2026
This model applies to all NAPLAN writing assessments created from 20 February 2026 onwards.
NAPLAN Rubric Only
These metrics apply specifically to the official NAPLAN writing rubric. Performance on custom rubrics uploaded by schools has not been independently validated.
What Do These Numbers Mean?
Understanding the metrics that matter for automated essay scoring
QWK = 0.86
Quadratic Weighted Kappa measures agreement between AI and human graders while accounting for chance. It penalises larger discrepancies more heavily than small ones.
A QWK of 0.86 falls in the highest band of the Landis & Koch scale, comparable to the level of agreement typically seen between trained human markers.
94% Within 1 Mark
In 94% of cases, our AI score is within one mark of the human expert score. This is the industry standard threshold for acceptable agreement in essay assessment.
Human-to-human agreement on NAPLAN writing typically falls in the same range, meaning our AI matches human marker reliability.
50% Exact Match
Our AI exactly matches the human score half the time. NAPLAN criteria use scales ranging from 0–2 to 0–6, so this represents strong performance across varying levels of granularity.
Combined with 94% within-1 accuracy, this shows the AI consistently lands in the right scoring neighbourhood.
Where Our Model Sits on the QWK Scale
The standard metric for evaluating automated essay scoring systems
Industry Benchmark
A QWK of 0.61–0.80 indicates substantial agreement. Scores above 0.81 fall in the highest band of the scale. Our model at 0.86 demonstrates agreement comparable to that seen between trained human-to-human NAPLAN markers.
Performance by Writing Skill
How our AI performs across 9 of the 10 NAPLAN writing criteria, for both text types. Character & Setting validation is in progress.
Paragraphing
0–2Ideas
0–5Sentence Structure
0–6Cohesion
0–4Spelling
0–6Vocabulary
0–5Punctuation
0–5Text Structure
0–4Audience
0–6Our Evaluation Methodology
How we validated our NAPLAN writing assessment model
NAPLAN Exemplar Essays
We evaluated against 43 annotated exemplar essays (22 narrative, 21 persuasive) with expert scores published in the official NAPLAN marking guides, across 9 of the 10 writing criteria.
Both Text Types
The model was evaluated separately on narrative and persuasive writing, each with their own skill criteria — including persuasive devices for persuasive texts.
Blind Evaluation
The AI model scored each essay without access to the human scores, using only the student writing and the official NAPLAN writing rubric. Each criterion uses its own scale (ranging from 0–2 to 0–6).
Multi-Metric Evaluation
We used QWK, exact accuracy, within-1 accuracy, MAE, RMSE, and bias analysis to provide a comprehensive view of model performance.
Narrative Writing
Persuasive Writing
Minimal Scoring Bias
Our model maintains near-zero bias across the scoring range
Average Bias Score (across both text types)
Slightly Conservative
The small negative bias means the AI occasionally scores marginally lower than human markers — a safer behaviour for formative assessment.
Practically Negligible
A bias of −0.07 across all criteria is negligible, confirming the model doesn't systematically over- or under-score students.
Frequently Asked Questions
Common questions about our AI scoring model and how we validate it.
Paid Opportunity
Get Paid to Validate Our AI
We're recruiting experienced teachers and NAPLAN markers to join our paid validation team. Score student essays using the NAPLAN rubric — your expertise directly improves the AI for thousands of Australian classrooms.
Paid per essay
Competitive rates for your professional marking time
Flexible schedule
Work on your own time — mark as many or as few essays as you like
Real impact
Every essay you mark makes AI assessment more reliable for Australian schools