Root cause analysis as a programme, not an event

Most asset-intensive operators have a folder full of completed root cause analyses. A serious failure happens, a team is convened, a report is produced, recommendations are signed off, and the document is filed. Six or twelve months later, the same failure mode reappears on the next pump, the next cable joint, the next motor-operated valve. The RCA was done. The reliability outcome was not. The gap is rarely about technique. It is about the absence of a programme around the technique.

This piece is about how to treat root cause analysis as a managed asset management programme rather than as a sequence of one-off events, what changes when you do, and what the result looks like a year in.

What an RCA programme is, and is not

A single RCA is an analytical exercise. It investigates one failure, identifies causes, and recommends actions. A programme is the operating model that decides which failures get analysed, with what method, by whom, what happens to the actions, and how the whole loop is measured.

IEC 62740:2015 sets out what a credible RCA process should include: defined triggers, structured analysis, identification of root and contributing causes, recommendations, implementation tracking and follow-up. The standard is technique-agnostic. It does not mandate fishbone, fault tree, Apollo or 5-Whys. It does require that an organisation know which one it is using and why.

A working RCA programme has five visible features: a written rule for when an RCA is triggered, a small library of methods matched to classes of problem, a single action register with owners and verification steps, a feedback loop to the people who can act, and programme-level measurement that is not just a count of completed reports.

Why most RCA practice never becomes a programme

Programmes that fail to mature usually fail in a recognisable set of places.

Triggers are reactive and inconsistent

The most common trigger is “an executive noticed”. Major safety incidents and high-cost outages almost always get an RCA. Repeat failures, near-misses, chronic low-grade losses and quality deviations rarely do, because the threshold is set by visibility rather than by impact. The result is that an operation analyses what hurt the headline and ignores what is quietly eroding the margin.

One method is used for every problem

An RCA team trained in one technique tends to apply it to everything. Fishbone is excellent for brainstorming contributing factors and poor for distinguishing causation from correlation. Fault tree is rigorous for safety-critical failures and disproportionate for a stuck damper. 5-Whys is fast and useful for simple human-error chains and weak for systemic, multi-causal failures. When a single hammer is used on every nail, the programme produces a mixture of over-engineered and under-engineered reports, and credibility decays.

The action register has no owner

This is the most damaging failure mode. Recommendations are written into the report, distributed to several departments, and tracked nowhere as a single list. Six months later, nobody can answer the question “how many RCA actions are open, on what assets, due when”. Without that answer, the programme has no idea whether it is actually changing anything.

Findings never reach the people who could act

A reliability engineer writes a report. It is reviewed in a management forum. It is filed. The maintenance planner who builds the PM, the technician who closes work orders against this equipment class, and the buyer who sets the spares policy never see it. The same failure mode is, in effect, rediscovered every time it happens.

The programme itself is never measured

Operators measure the count of RCAs completed. That number says nothing about whether failures are recurring less often, whether the cost of unreliability is falling, or whether the time from failure to corrective action is shortening. Without programme-level measures, RCA looks like activity rather than outcome, and budget conversations get harder.

Building an RCA programme that compounds

The pattern in operators who get sustained value is not exotic. It is methodical and supported by a small number of structural decisions.

Write the triggers down. Define, in one page, which events require an RCA. Reasonable triggers include any safety incident at or above a defined severity, any unplanned outage above a cost or duration threshold on a high-criticality asset, any failure mode that has now recurred a defined number of times in a rolling window, and any quality or environmental excursion above a threshold. Tie criticality bands to the threshold so that the trigger reflects the asset criticality framework rather than the day’s mood.
Match the method to the problem. Maintain a short library: 5-Whys for simple, single-thread issues; fishbone for contributing-factor brainstorming as an input to deeper analysis; Apollo or causal tree for chronic and multi-causal failures; fault tree for safety-critical and low-frequency, high-consequence events. Publish which method is used for which class of trigger so the analyst is not picking on instinct.
Hold a single action register. Every RCA recommendation lands on one register, in the CMMS or in the asset performance system, with an owner, a due date, a verification owner and a closure criterion. Status is reviewed in the same forum every month. An RCA without a tracked action is not a finished RCA.
Close the loop with the people who do the work. Feed structured findings back into the work management system. The corrective action becomes a PM change, a job plan update, a BoM correction, a spares stocking change or a training item, and it is visible against the equipment record. Without this step, lessons stay with the people who attended the analysis and the rest of the organisation continues to discover them through experience.
Measure the programme. Track time from failure to RCA start, time from RCA close to action verification, recurrence rate of analysed failure modes, and the share of unplanned outage hours covered by RCA actions in the last twelve months. These are the numbers that say whether the programme is working, not the count of completed reports.

These steps depend on the same operational ownership described in the wider piece on asset data governance. An RCA programme without named stewards behaves like every other programme without named stewards.

Where the programme pays back at portfolio level

Once RCA is operating as a programme, three portfolio-level conversations get easier.

Bad-actor analysis becomes credible. A short list of equipment tags accounts for a disproportionate share of unplanned outage hours on most plants. RCA actions traced against those tags show whether the bad actor is being managed down or simply being lived with. ISO 14224-coded failure data makes the ranking robust enough to defend in a capital review.

PM optimisation has somewhere to land its conclusions. RCA findings against a class of failure, joined with PM effectiveness data, justify changing intervals or task content from local evidence rather than from the manufacturer manual. The change is documented against the equipment record and survives a personnel handover.

Predictive and condition-based programmes get cleaner training data. Models trained on inconsistently coded failures produce inconsistent predictions. RCA discipline upstream is what makes Maximo Health, Maximo Predict and equivalent platforms behave reliably. Operators that skip this step usually rebuild it eighteen months later under a different name.

What good looks like a year in

A programme that has taken RCA seriously for a year tends to look the same across sectors. Written triggers are in the asset management manual. Method selection is documented per trigger class. The action register lives in one place and is reviewed monthly. Recurrence rates on analysed failure modes are falling on the equipment classes that matter. Reliability engineering publishes a quarterly note that names the bad actors and the actions closed against them.

If RCA actions are changing how work is planned, what is held in the storeroom and what is presented to the capital review, the programme is working. If the binder is growing and the failures are not, RCA is still an event.