Calibration done well

Quality · ~7 minute read

Without calibration, QA is theatre

Calibration is what turns a QA programme from a collection of individual opinions into a reliable operational signal. Without it, every evaluator is scoring to a private standard, the same contact gets different scores from different evaluators, and the agents quickly notice. With it, scores mean the same thing across evaluators, teams, and time — and only then can the score be taken seriously as an input to coaching, performance management, and planning. This article walks through what calibration actually means, what good calibration practice looks like, the warning signs of a programme that’s lost it, and how to rebuild calibration discipline when it’s drifted.

What calibration actually is

Calibration is the practice of evaluators scoring the same set of contacts independently, then comparing scores, surfacing disagreements, and agreeing the “true” score — not the average, but the consensus that the team takes forward. Done regularly, it does three things: it keeps individual evaluator drift in check, it reveals form items that aren’t scorable consistently, and it builds shared understanding of the standard.

The four parts of a real calibration programme

1. Frequency. Fortnightly is the sweet spot for most operations. Weekly is heavy; monthly is too slow to catch drift before it lands in agent conversations. Quarterly is barely calibration at all.

2. Sample selection. The sample for calibration is not random. It should include contacts that look easy (so evaluators agree quickly and build confidence), contacts that look contested (which surface disagreement and matter most), and contacts the form is known to score inconsistently on. The QA lead curates the sample deliberately.

3. Structure. Each evaluator scores independently before discussion. The independent scores are compared. Disagreements are surfaced and discussed. The team agrees a consensus. The consensus and the reasoning are documented. The form may be updated if a particular item proves consistently uncontestable.

4. Follow-up. What changed about scoring practice as a result of the session? If the honest answer is “nothing,” the session wasn’t calibration — it was a meeting. The discipline of documenting what changed and applying it to subsequent scoring is what separates calibration from a discussion forum.

Three signs the programme has lost calibration

1. Drift between evaluators. One evaluator consistently scores higher than another on similar contacts. Either the form is unclear, or the evaluators are diverging, or both. Either way the score has lost reliability.

2. Drift over time. The average score creeps up (or down) without any visible change in operational performance. The team has gradually softened (or hardened) without intending to. Calibration sessions catch this; the absence of them lets it run.

3. Drift across teams. One team’s scores systematically beat another team’s even when the operational metrics don’t justify it. Often a single evaluator pattern; sometimes a real difference in management style. Calibration surfaces which.

Rebuilding calibration discipline

If your programme has drifted, three moves rebuild it. Reset the standard with a high-intensity calibration burst — weekly for six weeks, on a structured sample — to re-establish what good looks like. Document the standard in a calibration guide with worked examples for every form item. Maintain the cadence after the reset — fortnightly, structured, owned by the QA lead. Without the maintenance, drift returns within months.

Calibration in an AI-led QA context

AI-led QA platforms need calibration too. The calibration is different in shape — reviewing samples the AI scored, tuning the model on contested cases, adjusting thresholds — but the discipline is the same. Operations that set AI-led QA up once and walk away find scoring quality drifts within a year. See AI-led vs human QA for the wider treatment.

Conclusion

Calibration is the discipline that gives QA scores meaning. Without it, the programme is theatre. With it, the programme produces a signal the operation can act on. The cadence, the sample selection, the structure, and the follow-up are all learnable; the hard part is the discipline of maintaining it. Operations that take it seriously develop a QA programme that lifts the operation; operations that don’t produce scores that nobody trusts.

Pair this with designing a meaningful QA programme, what to actually score on a quality form, and the QA vendor directory..