Calibrating the scorers

Deep-dive lesson · about 10 minutes · short quiz at the end

ccPlanning academy · quality · deep dive

Calibrating the scorers

If two assessors give the same call different scores, the number is noise.

The problem

Scoring is a judgement, and judgements vary.

Hand the same contact to three evaluators and you will often get three different scores. One is strict on tone, another lenient; one reads “resolved” generously, another doesn’t. Without correction, an agent’s score depends as much on who assessed them as on how they performed.

Agents know this immediately, and it destroys their trust in the programme.

The definition

Calibration aligns the scorers.

Calibration is the regular, structured practice of getting evaluators to score the same contacts and then reconciling the differences — until they converge on a shared standard. It is to QA what the two-pass forecast review is to planning: the discipline that makes the output trustworthy.

How it works

Score together, surface the gaps, agree the standard.

A calibration session: everyone scores the same contact independently, then the scores are revealed and the gaps discussed. Where they disagree, the group works out what the right answer is and why — updating the scoring guide so the next person doesn’t have to re-litigate it.

The output isn’t just aligned scorers; it’s a sharper, less ambiguous form.

What it reveals

Disagreement is usually a form problem.

When two careful assessors score a call differently, the cause is often not carelessness but an ambiguous item — “showed empathy” means different things to different people. Calibration surfaces those soft, undefined items and forces you to define them or drop them.

Persistent disagreement on an item is the item telling you it’s unscoreable as written.

Measuring it

Track inter-rater agreement.

You can measure how aligned your scorers are: the share of items where independent assessors land within a tolerance of each other. A healthy programme watches this number and reacts when it drifts — a calibration metric for the quality metric.

If agreement is low, fix that before you trust a single agent score.

Cadence

It’s a habit, not a launch event.

Calibration done once at programme launch decays immediately as new contact types appear, scorers drift, and the form evolves. It has to be a standing rhythm — regular sessions, new joiners calibrated before they score live, and a re-grip whenever the form changes.

The takeaway

Calibrate, or your scores are noise.

Score the same contacts, surface and reconcile the gaps, define the ambiguous items, track inter-rater agreement, and keep it a habit. Without calibration the rest of the programme is built on sand.

Now test yourself ↓

1 / 8

Slides done? Here’s the same idea in a bit more depth — the part worth keeping.

In depth: why calibration is non-negotiable

Quality scoring is an act of human judgement, and human judgements vary — on tone, on what counts as “resolved,” on how generously to read empathy. Give the same contact to several evaluators and you will routinely get several scores. Left uncorrected, that variation means an agent’s result depends as much on which assessor happened to pick up their call as on how they actually performed, and agents detect this instantly. Nothing erodes trust in a quality programme faster than the sense that the score is a lottery.

The practice and what it produces

Calibration is the cure: a standing discipline where evaluators independently score the same contacts, then reveal and reconcile their differences until they share a standard. Its most valuable by-product is a better form — because persistent disagreement on an item is usually the item’s fault, a soft phrase like “showed empathy” that means different things to different people. Calibration forces you to define such items precisely or remove them. You can measure how well it’s working through inter-rater agreement — the share of items where independent scorers land within tolerance — and a serious programme watches that number, reacts when it drifts, and treats calibration as an ongoing rhythm rather than a launch-day event. New scorers are calibrated before they score live work, and the whole group re-grips whenever the form changes. Without this, every downstream activity — coaching, reporting, feeding the plan — rests on numbers that don’t mean the same thing twice.

The principle to remember: if two assessors score the same call differently, the number is noise. Calibrate regularly, define the ambiguous items, track inter-rater agreement, and keep it a habit — the credibility of everything else depends on it.

Calibrating the scorers