Most breakdowns don’t come out of nowhere. We just stop paying attention. True instant failures do happen – but they’re the exception.
In practice, failure is usually preceded by signals well in advance: temperature creep, vibration, minor leaks, repeat adjustments, nuisance alarms, and minor quality escapes. Individually, none of these looks critical. Collectively, they’re telling a story.
The challenge is that these early signs rarely demand immediate action. The asset is still running. Output is being met. Nothing has crossed a hard stop threshold. And so attention moves on.
That’s how failures start — quietly, and early.
An operations reality
This pattern is well understood by people responsible for keeping assets running under real production pressure.
It’s something Jason Kennard, director of operations at MASPRO, sees repeatedly across site and workshop environments – early warning signs that are visible, understood, and often deprioritised simply because the system is still operating and targets are being met.
In those conditions, risk doesn’t disappear. It just gets carried forward.
When “not stopping production” becomes the priority
Early indicators are most often deprioritised when they don’t interrupt production. If a machine is still operating and targets are being met, issues that sit in the background are easy to push down the list. Time pressure, access constraints, and the desire to avoid unplanned downtime all play a role, but production pressure is usually the dominant driver.
Stopping now feels disruptive and costly. Stopping later feels abstract.
As a result, action is deferred. The immediate, visible cost of intervention outweighs the perceived risk of waiting. That trade-off makes sense in the moment, but it quietly shifts where risk is being carried.
When temporary conditions become “how it runs”
Many risk conditions don’t start as permanent decisions. They start as temporary measures.
- A workaround.
- A manual check.
- A parameter nudged slightly outside its original range.
- A noise everyone recognises.
Over time, these conditions stop being exceptions and start becoming normal. The machine still runs.
People adapt and monitoring replaces fixing.
The risk point isn’t when the issue first appears, it’s when “we’ll keep an eye on it” becomes an ongoing strategy rather than a short-term bridge.
At that point, the system may still be functional, but it’s no longer healthy.
Living in the grey space between uptime and intervention
Most operational decisions don’t sit at the extremes. They live in the grey space.
- Run until the next shutdown.
- Delay a repair until parts arrive.
- Accept reduced redundancy.
Each decision, on its own, feels reasonable. But together, they narrow the margin for error.
When failure finally occurs, it’s rarely caused by a single missed action. It’s the accumulation of small deferrals, each one logical at the time, that leaves the system with nowhere to go when conditions shift.
Why failures are called “sudden” after the fact
After a breakdown, the language is familiar: sudden, unexpected, bad luck. Occasionally, that’s true. More often, it reflects how earlier warning signs were framed at the time – as manageable, acceptable, or not urgent enough to justify stopping.
In hindsight, the signals were there. They just never crossed a threshold that forced action until it was too late.
The failure wasn’t invisible. It just wasn’t prioritised.
The hardest moments to act
Most operations teams can recognise situations where something doesn’t feel right, but work continues anyway. Those moments are the hardest to act on early. The evidence isn’t definitive. Stopping feels disruptive. Accountability is shared. And intervening while the system is still running can feel harder than responding once it’s failed.
This is where reliability is actually decided – not at the moment of breakdown, but during long periods where something didn’t quite look right, yet not wrong enough to stop.
A different way to think about reliability
If failures are rarely sudden, then reliability isn’t about how fast we respond once something breaks.
It’s about how early we’re willing to pay attention. How seriously do we treat weak signals. And how much risk we’re prepared to carry simply because the system is still running?
This reframes reliability as an active, ongoing discipline – not a reaction to failure.
Reliability Starts Before Failure
Early warning signs are often present long before equipment fails. Discover how a proactive approach to reliability helps operations identify risk sooner, reduce unplanned downtime, and maintain greater control over asset performance.