AI Model Performance

NAPLAN Writing AI Comparable to Human Markers

Validated against the official NAPLAN writing rubric, our AI achieves a Quadratic Weighted Kappa of 0.86 — comparable to the level of agreement between trained human markers.

Become a Validator

0.86

Quadratic Weighted Kappa

94%

Within 1 Mark Accuracy

50%

Exact Match Accuracy

Model Version: 20 February 2026

This model applies to all NAPLAN writing assessments created from 20 February 2026 onwards.

NAPLAN Rubric Only

These metrics apply specifically to the official NAPLAN writing rubric. Performance on custom rubrics uploaded by schools has not been independently validated.

What Do These Numbers Mean?

Understanding the metrics that matter for automated essay scoring

QWK = 0.86

Quadratic Weighted Kappa measures agreement between AI and human graders while accounting for chance. It penalises larger discrepancies more heavily than small ones.

A QWK of 0.86 falls in the highest band of the Landis & Koch scale, comparable to the level of agreement typically seen between trained human markers.

94% Within 1 Mark

In 94% of cases, our AI score is within one mark of the human expert score. This is the industry standard threshold for acceptable agreement in essay assessment.

Human-to-human agreement on NAPLAN writing typically falls in the same range, meaning our AI matches human marker reliability.

50% Exact Match

Our AI exactly matches the human score half the time. NAPLAN criteria use scales ranging from 0–2 to 0–6, so this represents strong performance across varying levels of granularity.

Combined with 94% within-1 accuracy, this shows the AI consistently lands in the right scoring neighbourhood.

Where Our Model Sits on the QWK Scale

The standard metric for evaluating automated essay scoring systems

Poor< 0.20

Fair0.21 – 0.40

Moderate0.41 – 0.60

Substantial0.61 – 0.80

Almost Perfect0.81 – 1.00

0.86

Industry Benchmark

A QWK of 0.61–0.80 indicates substantial agreement. Scores above 0.81 fall in the highest band of the scale. Our model at 0.86 demonstrates agreement comparable to that seen between trained human-to-human NAPLAN markers.

Performance by Writing Skill

How our AI performs across 9 of the 10 NAPLAN writing criteria, for both text types. Character & Setting validation is in progress.

n = 198 scores across 22 essays|QWK = 0.862|Within 1 = 94.4%|Accuracy = 50.0%

Paragraphing

0–2

Exact: 82%Within 1: 100%

Ideas

0–5

Exact: 73%Within 1: 100%

Sentence Structure

0–6

Exact: 50%Within 1: 100%

Cohesion

0–4

Exact: 45%Within 1: 100%

Spelling

0–6

Exact: 45%Within 1: 95%

Vocabulary

0–5

Exact: 45%Within 1: 100%

Punctuation

0–5

Exact: 41%Within 1: 91%

Text Structure

0–4

Exact: 41%Within 1: 95%

Audience

0–6

Exact: 27%Within 1: 68%

Our Evaluation Methodology

How we validated our NAPLAN writing assessment model

NAPLAN Exemplar Essays

We evaluated against 43 annotated exemplar essays (22 narrative, 21 persuasive) with expert scores published in the official NAPLAN marking guides, across 9 of the 10 writing criteria.

Both Text Types

The model was evaluated separately on narrative and persuasive writing, each with their own skill criteria — including persuasive devices for persuasive texts.

Blind Evaluation

The AI model scored each essay without access to the human scores, using only the student writing and the official NAPLAN writing rubric. Each criterion uses its own scale (ranging from 0–2 to 0–6).

Multi-Metric Evaluation

We used QWK, exact accuracy, within-1 accuracy, MAE, RMSE, and bias analysis to provide a comprehensive view of model performance.

Narrative Writing

QWK0.862

Within 194.4%

Accuracy50.0%

MAE0.561

Bias−0.106

Persuasive Writing

QWK0.826

Within 194.3%

Accuracy50.5%

MAE0.557

Bias−0.024

Minimal Scoring Bias

Our model maintains near-zero bias across the scoring range

−0.07

Average Bias Score (across both text types)

−3

0 (No Bias)

Slightly Conservative

The small negative bias means the AI occasionally scores marginally lower than human markers — a safer behaviour for formative assessment.

Practically Negligible

A bias of −0.07 across all criteria is negligible, confirming the model doesn't systematically over- or under-score students.

Frequently Asked Questions

Common questions about our AI scoring model and how we validate it.

NAPLAN criteria use different scoring scales — from just 3 possible values (0–2 for Paragraphing) up to 7 possible values (0–6 for Audience). On the wider scales, exact agreement is harder even between trained human markers. The more meaningful metric is within-1 accuracy (94%), which shows the AI is almost always in the right scoring neighbourhood. Human-to-human marker agreement on NAPLAN writing shows similar patterns.

We tested our AI against the annotated exemplar essays published in the official NAPLAN narrative and persuasive writing marking guides. Importantly, the AI was not trained on these documents — they were used purely as an independent test set. Each essay comes with expert scores assigned by ACARA, and our AI scored them blindly without seeing those scores. We then compared the results using multiple statistical measures including QWK, exact accuracy, within-1 accuracy, MAE, and bias analysis.

Paid Opportunity

Get Paid to Validate Our AI

We're recruiting experienced teachers and NAPLAN markers to join our paid validation team. Score student essays using the NAPLAN rubric — your expertise directly improves the AI for thousands of Australian classrooms.

Paid per essay

Competitive rates for your professional marking time

Flexible schedule

Work on your own time — mark as many or as few essays as you like

Real impact

Every essay you mark makes AI assessment more reliable for Australian schools

Get Paid to Validate Our AI

Paid per essay

Competitive rates for your professional marking time

Flexible schedule

Work on your own time — mark as many or as few essays as you like

Real impact

Every essay you mark makes AI assessment more reliable for Australian schools

NAPLAN Writing AI Comparable to Human Markers

What Do These Numbers Mean?

QWK = 0.86

94% Within 1 Mark

50% Exact Match

Where Our Model Sits on the QWK Scale

Performance by Writing Skill

Paragraphing

Ideas

Sentence Structure

Cohesion

Spelling

Vocabulary

Punctuation

Text Structure

Audience

Our Evaluation Methodology

NAPLAN Exemplar Essays

Both Text Types

Blind Evaluation

Multi-Metric Evaluation

Narrative Writing

Persuasive Writing

Minimal Scoring Bias

Frequently Asked Questions

What is Quadratic Weighted Kappa (QWK)?

Why is exact match accuracy only 50%?

Why do different criteria have different score ranges?

Does this validation apply to custom rubrics?

Should I use AI scores as final grades?

How was the model validated?

What does a negative bias mean?

How will the model improve over time?

Get Paid to Validate Our AI

Paid per essay

Flexible schedule

Real impact

Apply to Join

NAPLAN Writing AI Comparable to Human Markers

What Do These Numbers Mean?

QWK = 0.86

94% Within 1 Mark

50% Exact Match

Where Our Model Sits on the QWK Scale

Performance by Writing Skill

Paragraphing

Ideas

Sentence Structure

Cohesion

Spelling

Vocabulary

Punctuation

Text Structure

Audience

Our Evaluation Methodology

NAPLAN Exemplar Essays

Both Text Types

Blind Evaluation

Multi-Metric Evaluation

Narrative Writing

Persuasive Writing

Minimal Scoring Bias

Frequently Asked Questions

What is Quadratic Weighted Kappa (QWK)?

Why is exact match accuracy only 50%?

Why do different criteria have different score ranges?

Does this validation apply to custom rubrics?

Should I use AI scores as final grades?

How was the model validated?

What does a negative bias mean?

How will the model improve over time?

Get Paid to Validate Our AI

Paid per essay

Flexible schedule

Real impact

Apply to Join