Data quality for forecasting: garbage in

Forecasting · ~6 minute read

The error that isn’t in the model

Forecasters love to argue about method — Holt-Winters versus regression, this smoothing factor versus that. It’s the interesting part, and it’s rarely where the biggest error lives. Most bad forecasts come not from the wrong model but from the wrong history: dirty, unannotated data that the model faithfully learns and confidently projects forward. A clever method trained on contaminated history is worse than a simple one trained on clean history, because the clever method is better at memorising the noise. Forecasting is mostly data preparation, and the teams that quietly forecast well are the ones that clean before they model.

What contaminates the history

The usual suspects are mundane and everywhere. One-off spikes: a system outage that dumped a day of calls into the next, a product recall, a single marketing blast — real events that will never recur on that date, but that a naive model treats as seasonal truth. Gaps and zeros: closed days, bank holidays, a logging failure that recorded nothing — left as zeros, they drag the average down; filled blindly, they invent demand. Level shifts: a process change, a new product, a migrated phone number that permanently moved the baseline, so last year’s level is simply the wrong starting point. And quiet corruption: timezone and daylight-saving glitches that smear an interval, or contact-reason codes that changed meaning halfway through the year. None of these is exotic; all of them, left in, teach the model the wrong lesson.

Clean the history before the model sees it outage spike logging gap raw history cleaned & annotated Adjust the known one-offs, fill the gaps deliberately, mark the level shifts.
The model can only be as good as the series it learns. A flagged, adjusted, gap-filled history is worth more than any change of method.

Cleaning it without scrubbing the signal

The discipline is to remove the noise without erasing the truth. Keep an events log — a simple record of outages, campaigns, launches and process changes — so you can explain every anomaly rather than guess at it; this is the single highest-return habit in forecasting. Adjust known one-offs to a sensible “what it would have been” rather than deleting the day. Fill gaps with intent (a like-day, a seasonal estimate), never with a zero you didn’t mean. Detect level shifts and start the relevant history after them, instead of averaging across a change that makes the old data irrelevant. And resist the urge to smooth away genuine variation — real demand is noisy, and a series that’s too clean is one you’ve over-fitted by hand. Get the history honest and a simple model will beat a sophisticated one fed on garbage every time.

Pair this with avoiding forecast complacency, Poisson and natural noise, and forecast accuracy metrics.