The system-outage playbook — operating when the technology doesn’t

Operations · ~7 minute read

Every contact centre runs on three systems it cannot operate without: the telephony that carries the work, the CRM that carries the customer, and the WFM that carries the plan. Each fails differently, each needs its own degraded mode, and none of them can be improvised at the moment of failure. The outage playbook is the difference between a bad hour and a lost day.

Three outages, three different problems

Telephony down means customers cannot reach you at all — the problem is invisible from the floor (eerily quiet) and entirely visible to the customer. CRM down means customers reach agents who cannot see them — AHT climbs across every queue at once, error rates spike, and agents improvise with whatever they can remember. WFM down means the operation flies blind — the work and the customers are fine, but nobody knows who should be where, real-time visibility is gone, and adherence quietly dissolves. Treating these as one generic IT issue is the first failure: the correct response to each is almost opposite.

The playbook therefore holds three entries, not one. Each names the detection signal (and who confirms scope — one queue or all, one site or all), the degraded mode to switch to, the message to publish, and the escalation route with severity pre-agreed with IT. The scope check matters most: a partial outage handled as total wastes capacity; a total outage handled as partial burns an hour you do not have.

Degraded-mode operating — decided in advance

A degraded mode is a pre-defined way of running, not a euphemism for chaos. CRM down: agents work from a published offline script set, capture customer details on a structured form for later entry, and are explicitly relieved of normal AHT and quality expectations — announced, not assumed, because agents who think they are still being measured to normal standards will rush, err, and burn out inside an hour. Telephony down: the operation pivots to the channels still standing — outbound on mobiles where compliance allows, chat and email queues worked hard, the website and IVR carrier message updated within minutes.

WFM down has the most tempting failure mode: carry on and hope. Better is the paper plan — the last published schedule exported daily precisely for this moment, team leaders running breaks from a printout, and a single coordinator keeping a manual tally of state. None of these modes is good. All of them are better than improvisation, and every one of them only works if it was built, written down, and walked through before the day it was needed.

Queue messaging — honesty buys patience

During an outage the queue message is doing the work your agents cannot. The discipline is honesty with direction: we have a technical problem, here is what still works, here is what to do instead, here is when to try again. Customers forgive failure far more readily than they forgive pretence — a cheerful your call is important to us on a forty-minute degraded queue manufactures the anger that lands on the first agent who finally answers.

The messages should already exist: drafted in calm conditions, approved by whoever needs to approve them, and loaded so they can be switched on in minutes — not composed by committee while the queue builds. The same applies to the website banner, the IVR announcement, and the social-channel holding line. The failure pattern is well known: the outage is forty minutes old, the customer-facing estate still says everything is fine, and social media is doing your incident communication for you, with commentary.

Manual fallbacks — unglamorous and decisive

The fallbacks that save an outage are embarrassingly low-tech. A printed daily schedule. A structured capture form — paper or a simple offline document — so that work done during the outage can be entered, actioned and counted afterwards rather than lost. A laminated card of the top-ten process scripts agents need when the knowledge base is unreachable. A callback list, captured cleanly with consent, so that demand displaced by the outage is recovered deliberately instead of returning as an angry surge. A printed phone tree for the moment the outage takes your internal comms down with it.

Two disciplines make fallbacks real rather than theoretical. First, they are maintained — a fallback pack last updated two restructures ago is worse than none, because people trust it. Second, they are rehearsed: once or twice a year, run an hour in degraded mode and watch what actually breaks. The rehearsal always finds something — the export nobody runs any more, the form that does not capture the field the recovery team needs — and finding it on a calm Tuesday is the whole point.

The post-incident review — where the playbook gets better

Within days, while memory is honest, run the review — blameless about people, ruthless about process. Reconstruct the timeline: when did it start, when did we notice, when did we confirm scope, when did degraded mode actually switch on, when did the customer messaging change? The gap between started and responded is the number that matters, because it is the number the playbook exists to shrink. Then the harder questions: which fallbacks worked, which existed only on paper, and what did the displaced demand do to the following week?

The output is changes, each with an owner and a date: a message re-drafted, a threshold moved, a fallback pack rebuilt, an escalation route renegotiated with IT. Feed the demand displacement into the planning loop so next time the recovery tail is forecast, not a surprise. Operations that review like this find their second outage runs visibly better than their first. Operations that skip it have the same first outage repeatedly, each time with fresh astonishment.

The closing principle

You cannot prevent the outage; you can only decide in advance how you will operate inside it. Three systems, three degraded modes, messages already written, fallbacks already rehearsed — and a review that measures the gap between failure and response, then shrinks it.