Simplifying Root Cause Analysis Part Two: When to do an RCA

Apr 15, 2019 10:33:31 AM

Don’t waste your reliability engineer’s time on pointlessly complicated RCA’s.  Let your maintenance execution team drive your breakdown response.


In our previous article, we outlined an approach to simplifying Root Cause Analysis (RCA).  We talked about our observation that often, RCA’s are left to the reliability engineers. Reliability engineers can add significant value to an organisation (we’ve written previously about them here). However, they can be used in many ways that make it hard for an organization to improve the reliability of its assets. Chief among these mistakes is having a reliability engineer undertake a complicated RCA when it isn’t really needed.

In our experience, the vast majority of failures don’t require reliability engineers to complete an RCA, but instead can be done faster – and with a simpler technique - by the maintenance execution team.  There are a few reasons for this:

  • Most failures have relatively straightforward causes and don’t need complex investigations
  • Maintenance execution teams are generally better placed to capture and analyse the information needed to determine a failure mode and investigate a breakdown (reliability engineers tend to take longer and complicate the process!)
  • The quality of maintenance execution tends to have a greater impact on equipment reliability than the strategy. Maintenance execution quality improves when a culture of equipment ownership and accountability exists in the execution team.

Bluefield have developed a model (below), which we use for our Bluefield Transformation Projects, that helps sites implement a breakdown response and failure investigation process led by the maintenance execution team.  We’ve found it greatly simplifies the RCA process and limits the number of times that reliability engineers are needed.


breakdown response model


We’ll go through the model to explain the key principles.

Trigger Investigations Based on Failure Mode, not Down Time

Most sites we visit have a threshold – for example, four hours – where all downtime events above this level require an RCA. Of course, major failures need to be investigated and prevented from recurring.  However, the aim should also be to produce a culture of equipment ownership and accountability.  To do this, we need to understand and address all instances of a failure mode.

In our experience, the failure investigation process should begin with a list of the top five failure modes that the team needs to manage.  Reliability engineers can add a lot of value here; the aim is to conduct a Pareto analysis of which failure modes (not components or assets) are contributing the most downtime.  Once you have this priority list, any breakdown involving these top failure modes, no matter how brief, triggers an investigation.  You continue this focus until the failure mode drops off the list, then move onto the next failure mode.

Maintenance Execution Team Asks Simple Questions

The maintenance execution team should start (and ideally finish) the investigation process themselves without involving the reliability engineers.  The execution team should start by examining the failed part to determine the failure mode (we’ll talk more about this in our next article).  They can then ask three simple questions:

  • From the failed part, do we know the failure mode?
  • Is there already a maintenance task to manage the failure mode?
  • Are we certain that the task has been executed adequately?

In the majority of failures, these simple questions uncover a problem that can be fixed by the maintenance execution team without involving a reliability engineer.  If, however, the team can’t determine the failure mode without performing a technical failure analysis, or if they answer “No” to questions 2 and 3, then a reliability engineer can be brought into the investigation to help perform a formal RCA.

Communication and Learning are Key

The process of building a culture of equipment ownership and accountability is never-ending.  It requires the entire team to communicate and learn from every breakdown.  We’ve found that an effective way to do this is to embed this discussion in the team’s pre-start meetings.  To start, have a failed parts bin next to the pre-start board, where all failed parts are placed for examination.  The pre-start meeting also should involve a conversation on all priority breakdowns, and these should be tracked until the investigation is complete and the causes are identified and corrected.

It is also important to not only share the learnings from all breakdown investigations, but to keep the team focused on maintenance execution quality at all times.  Some of our clients have a quality section on their pre-start boards, where they can (anonymously) show good and bad examples of maintenance task execution.

Also, it’s important to turbo-charge this learning by asking “what else are we missing?” If a team didn’t perform an inspection thoroughly, or didn’t perform certain tasks at all, are these the only examples, or are there others that need to be fixed too?

By using this process to have your maintenance execution team lead failure investigations, you can keep your reliability engineers free to focus on the more complex investigations and defect elimination projects, whilst also continually building your maintenance execution quality.  We’ve written a case study to illustrate this point.

In our next article, we’ll look at the importance of identifying the direct causes (failure modes) of a breakdown.

Got a question about when you should do an RCA? Click here and ask the Bluefield Community via webRE and we'll get back to you free of charge.

By Matthew Grant