Simplifying Root Cause Analysis Part One: Overview

Apr 8, 2019 9:47:23 AM

How to simplify your RCA effort for maximum results.


On most sites that Bluefield visits, reducing unscheduled downtime by addressing breakdowns is the first step to improving maintenance.  Most sites tackle breakdowns through a Root Cause Analysis (RCA) failure investigation. Unfortunately, many people now think it’s the reliability engineer’s job to fix all problems to do with reliability, and leave them to conduct an RCA with little or no assistance from the maintenance execution team.  (See our article on The Top 9 Mistakes Companies Make With Reliability Engineers).

The result is that RCA’s are often overly-complicated and process-driven, consume excessive man-hours from the reliability engineer, and by the time they are completed, several other issues have arisen.  (At one site we visited recently, they had a 6-month backlog of RCA’s to complete).  Worse, the RCA’s often don’t arrive at a genuine root cause that enables a defect elimination project to prevent the breakdown from recurring.

To simplify RCA, we have learned that the following principles are essential to apply:


The failure investigation and defect elimination processes must be owned and led by the maintenance execution team, with the reliability engineers providing support only when necessary.

Ask Simple Questions

Bluefield has used a very simple model (below) to undertake RCA failure investigations.  Rather than generating a large tree of causes, our model is based on the principle that there are only a limited number of indirect and systemic (root) causes that result in unexpected breakdowns.  Therefore, our investigation process asks a series of yes/no questions designed to work out which of the causes are present.  It’s also important to remember that there’s almost never a single cause for a breakdown; the trick is to work your way backwards through the chain of causes, being just thorough enough to find problems you can fix without getting bogged down in excessive detail.  We’ll go through the model in the following sections.

Start with Direct Cause(s)

RCA’s must start with the direct cause – simply identifying the failure mode.  Obviously, a failure occurs when an asset fails to perform its required function(s).  Therefore, retaining and analysing failed parts to determine the technical failure mode(s) is essential.  (We’ve seen far too many instances of failed parts being thrown out before an investigation begins).  If necessary, you can engage the reliability engineers to assist with forensic failure analysis.

Move to Indirect Causes

Once you are clear on the direct cause (failure modes), you should then look for the actions that were performed (or not), and the conditions that existed (or were absent), that allowed the failure mode to occur.  We’ve found that there are only four main indirect causes for a breakdown.  The process of an RCA, once we’ve found the failure mode, is to determine which one (or more) of these is a contributing factor:

  • Inadequate strategy/plan: Was there an appropriate task or tactic to manage the failure mode?
  • Inadequate maintenance execution: Was the task or tactic properly executed?
  • Improper operation: Was the equipment operated outside its design limits or intended operating context?
  • Technical defect: Was there a poor-quality component or spare, or was the equipment designed or supplied with a defect or fault that prevents it from reaching its expected life? (Only consider this once the first three causes are eliminated).

Finish by Confronting the Systemic (Root) Causes

Indirect causes are only addressed by identifying and fixing systemic (root) causes.  Teams must be prepared to go through the (often confronting) process of looking for the following issues in the way they work:

  • Inadequate individual capabilities: Did the team have the knowledge and skills to perform the tasks required of them?
  • Inadequate organizational capabilities: Does the team have issues with leadership, culture and/or alignment that are allowing tasks to be performed poorly or missed completely?
  • Inadequate systems of work: Are the team’s processes and systems contributing to poor or missing tasks and tactics?
  • Inadequate resources: Is the team lacking the tools, equipment, technology, time or other resources to execute the required strategies and plans? (Be careful not to use this as a cop-out).

Over the next few weeks, we’ll release a series of articles explaining each of these principles, and go through some case studies to demonstrate how they can be applied.


By Matthew Grant