Don’t over-complicate your RCA – there’s only so many causes for failures.
In our previous article, we talked about the importance of identifying the failure modes that cause breakdowns. Behind each failure mode is an indirect cause; either an action has been performed (or omitted), or a condition was present (or absent), that enabled the failure modes to develop.
We’ve learned that there’s no need to over-complicate the process of finding indirect causes, because when you drill down, there are generally only four general causes to look for (see our model below). We’ll briefly explain each.
Inadequate Strategy or Plans
The first cause is related to an inadequate (or missing) maintenance strategy, tactic or task to manage the failure mode. Identifying these causes is generally straightforward; look for one of the following:
The maintenance strategy for the asset does not address the failure mode in question (there’s no inspection, condition monitoring or planned replacement task)
There is a task to address the failure mode, but it’s inadequate. An example that we often see is that the interval between inspections is greater than the p-f interval for the failure mode.
The maintenance task isn’t specific about what acceptable and unacceptable looks like. How many inspection sheets have you seen that simply say “check for wear” without supplying specifications or measurement limits for when to act? In these cases, you’re inviting the maintainer to apply their own subjective judgement instead of an objective standard, and the result is inconsistent performance.
The scheduled downtime strategy for the machine doesn’t allow enough lead time to fix a defect once it’s discovered. On a site we visited recently, they had just moved to a 13-week shutdown interval. The change put them in the position where they were finding defects on their screen decks and pumps, but the longer interval meant they couldn’t nurse them through until the next shutdown.
The task instructions are technically incorrect or deficient. (See our case study for an example of this type of cause).
Inadequate Maintenance Execution
In some cases, we find that there is an adequate strategy or tactic to manage the failure mode, but it’s been executed poorly (or not at all). When we talk maintenance execution, we mean the entire maintenance work management process, not just the time on the tools. You should look for the following factors (talk to the team members and check the CMMS):
The task was not planned or scheduled in line with the strategy. The reason could be a planning or scheduling error, or it could have been omitted if the maintenance schedule is full.
Defects were not identified by the maintainer. We see this all the time through the so-called “tick-and-flick” culture, but it could also be that the maintainer genuinely did not recognise a defect. There are multiple reasons, so you always need to dig further to clarify which one.
Defects don’t make it into the system. All too often, we find defects are picked up by maintainers, but subsequent defect notices aren’t raised, so they’re not scheduled for repairs.
Work is performed to poor quality standards. Technical craftsmanship is vital, be it welding, torqueing a bolt, aligning a drive train, or many other maintenance tasks.
Maintenance plans are developed to maintain equipment based on an expected operating context. If the equipment operates outside this context or the design limits, the failure modes will occur at a frequency that was not anticipated. There are two main factors to look for:
The asset is exposed to an environment outside its design limits. This can include the basics like temperature, moisture, dust etc, but in the heavy industrial environment you should also look at things like the feed stock (eg ore size, pH etc).
The asset is operated outside its expected operating context. We frequently think of overloading and accident damage, but other examples are incorrect gear selection/over-revving the engine, working grade/slope, tramming distance etc, which degrade the machine over time.
In general, we only look for technical defects after we’ve exhausted the first few categories, because it’s easy to blame something which we can think of as being out of our control. However, you should look for the following:
A spare or rotable component has been supplied with a defect (or it has degraded due to inappropriate storage).
A genuine design error has been made – either the failure mode wasn’t considered, or the design was inadequate to meet its functional specification.
Using the Model for Failure Investigations
In our experience, the best way to use this model is as a checklist to ensure you’re thorough in considering all possible causes, but without getting bogged down in too much detail. Two principles are important:
Base your conclusions on evidence. In addition to the failed part, you should be talking to the maintainers, looking at the work orders in the CMMS, condition monitoring reports, completed service sheets etc. Don’t allow people to make claims or push theories that they can’t prove. At the same time, don’t waste time going down a rabbit hole. If the service sheet or work order is missing from the system, it’s almost certain the maintenance process has broken down at that point.
Don’t get caught up in trying to build a nice-looking RCA tree. In most cases, the specific sequence of cause-and-effect matters less than whether you’ve found a problem that’s worth solving.
We’ve written another case study to show some examples of applying these principles. In our final article, we’ll talk about how to move from indirect causes to the systemic (root) causes, which are the true problems you need to solve to improve asset reliability.