Simplifying Root Cause Analysis Part Three: Determining Failure Modes

Apr 15, 2019 11:09:22 AM

Get maximum value from your defect elimination program by sticking to the facts and avoiding rumour and innuendo.


In our previous articles, we’ve outlined our learning about simplifying Root Cause Analysis (RCA), as well as the important role of the maintenance execution team in the failure investigation process.  In this article, we’ll explore the front end of the process in more detail to show that a clear technical understanding of how the failure occurred is essential to getting to the root cause and preventing a recurrence.

In all too many cases, when Bluefield works with clients to help them improve their asset reliability, we find that their failure investigations don’t include an evidence-based, factual account of how the failure occurred.  Sometimes, there is such a rush to get the equipment going again that the failed parts are thrown out without any sort of analysis.  Sometimes, the investigation is thrown to the reliability engineer to take care of, but without all the relevant information from the maintainers, leading to an incomplete failure analysis.

Without an analysis of the failed part, the investigation quickly moves away from being fact-based and into the realm of rumour and innuendo.  Members of the team who have a pet theory or a particular barrow to push can do so without being challenged, and the investigation may end up coming to the wrong conclusion and trying to solve a non-existent problem, leaving the true problems unfixed.

We’ve introduced our RCA process model previously (below), and have expanded it to focus on the critical step of establishing the Direct Cause(s).  There are a few key principles to apply:


root cause analysis process model


Focus on the Failure Mode, not the Component

RCA investigations start with a functional failure; that is, an asset has failed to perform one of more of its required functions.  Normally, we’re talking about breakdowns, but there are other forms of loss including throughput reduction, quality issues etc.  Functional failures are caused by one or more technical failure modes; components that have suffered a fault (eg cracked, seized).

In too many cases, the failure is only recorded down to the component level (eg bearing), not the failure mode itself.  (We’ve written about this previously).  You need to be specific, because only when you’ve identified the failure mode that you can move to the next part of the investigation process; checking to see if the maintenance strategy is adequate and if it was executed properly.  Without understanding the failure mode, it’s impossible to answer these questions.

Keep the Failed Parts

We’ve written previously about the importance of a failed parts bin being part of your failure investigation process; get into the habit of keeping and investigating every failure.  In most cases, the failure mode is obvious.  In some cases, however, determining the component defect may require a forensic analysis.  On these occasions, it’s appropriate to engage the reliability engineering team to assist.  We’ve written a case study to illustrate this point.

Close the Loop to the Defect Elimination Process

Defect elimination projects don’t just focus on the big failures; many of them start with a reliability analysis such as a Pareto to find recurring failures.  It’s important to close the loop between your failure investigation and defect elimination processes by ensuring that enough meaningful data is captured to allow your reliability engineers to identify recurring failures.  Knowing that you had 15 bearing failures in a month is not really all that useful – did they seize from lack of lubricant, were they misaligned when they were installed, or were they brinelled from improper storage?

One technique you can use is to capture this information in each work order using the 3C’s approach.  Ensure that the information is entered against each functional failure work order, and you’ll make it much easier to improve the reliability of your assets.


In our next article, we’ll look at how to identify and classify the indirect causes behind each failure mode.


By Matthew Grant