No matter how hard we work, and how effective our asset management practices are, we’ll occasionally have an asset fail on us. Sometimes, we’ll have massive failures. Serious asset failures can highlight not only opportunities to improve our strategies and maintenance execution, but also the way we prepare for and respond to these events.
Many of us at Bluefield have gone through these experiences, so we asked some of our team to share their learnings by asking them the following question:
From a downtime or cost perspective, what's the worst asset failure you've ever been a part of? What happened, and how could (should) your team have been better-prepared to respond to the failure?
From a downtime perspective the worst failure we had to deal with was when I was at a one-dragline coal mine. The site relied on the dragline production to fit into the schedule of coal extraction and timing was very important. During a routine x ray of the boom suspension ropes at around half the expected rope life (5 years) we found that there were a significant portion of the wires broken inside the socket. It was past the acceptable limits and the number broken had increased significantly since the previous x ray, completed 12 months prior.
I had to make the call to ask the site manager to shut down the dragline. Of course, I first asked the condition monitoring company to double and triple check the results, which they confirmed were accurate. This was a big deal for the site due to the significance of the dragline on the production. The manager shutdown the machine and as always it was a Friday when we got the final results back.
This meant we started planning the response over the weekend. The problem was that there were no spare ropes and there was only one facility in Australia that could manufacture them. We started the manufacture process as well as the project to lower the boom and prepare the machine for them to be replaced. We tried to get the ropes manufactured while the boom was being lowered but of course the manufacture took longer. Luckily though we were able to get the machine back to work in two weeks.
The big learning was the fact that we did not have any spares, nor did we have a backup plan identified in advance of the failure. We could have foreseen something like this if we had prepared ourselves and identified these critical spares.
Gerard previously facilitated a round table series on Becoming a Maintenance Manager – read it here.
Not specific to one piece of plant, but two occasions when working for an OEM dealer.
The first scenario was that we had a truck engine failure at 525 hrs. After this failure we were advised by the manufacturer that all engines of that model on site had the shot peen process missed from factory. This included all engines on the three dig units.
We had to put together a plan on how we were going to remove every sump on every engine and using a needle gun, shot peen all main bearing caps. No planning, just crisis management. This involved sourcing labour and a tight schedule to complete the work.
Second scenario, we were advised by the OEM that all our engines had to be removed to replace the head gaskets. The gaskets were updated preassembly and had seen failures elsewhere in the network, plan was to replace with the original version gasket pre-update.
As the bloke in charge of the site budget I was negotiating an outcome with the OEM to ensure that all cost was covered, including stripping the engines, machining the blocks, updating aftercoolers, spare engine cores so we could keep the fleet going during the period of the engine changes.
I was electrical maintenance planner at a coal mine when a fire started on the tail end of the conveyor taking coal from our Coal Preparation Plant Load Out bin to the Power Station. In those days we only ran Monday to Friday and the fire started after our operators had shut down the plant for the weekend on Friday night.
The conveyor to the power station was shut down shortly after, but a hot bearing on the tail pulley which was covered in grease and coal dust started the fire. Because there were no cameras or people in the Load Out area, the fire was only noticed by our security gate person when there was a glow and smoke seen later in the night.
By the time the fire was extinguished it had burnt a portion of the conveyor belt, melted the stringers (C channel framework) supporting the conveyor idlers and belt, damaged the Syntron feeder below the bin and a lot of electrical cables and controls in the area. The amount of structural damage due the heat was a real eye opener for me. Luckily, the coal in the bin had not ignited or it would have been a major catastrophe.
The conveyor was maintained by the Power Station but we had the Feeder and Bin above the Conveyor, so it was a combined effort to replace structure, the Feeder, cabling and controls and splice a new section into the belt and then recommission everything. I can't remember exactly how long the plant was down, but it was a week or so working around the clock, and contingency plans were drawn up if it became a longer event.
Main learnings were:
- Monitoring of pulley bearing temperature and vibration (operator and maintainer checks)
- Keep tail end areas of conveyors clean
- Carry out walk around (or drive around) inspection after plant shut down
- We should have had a contingency plan in place prior to this event just in case
- Fire is a real hazard when coal dust and combustible hydrocarbons are present with a heat source
Watch Tom talk about his career here.
Stefan Van Der Linde
A few failures spring to mind for me, one is around poor handling of the breakdown itself, the second was a large failure but a good response, the third is around how previous breakdowns can precede future breakdowns (the importance of good asset management!)
One - A failed shoe knuckle pin on a Marion dragline resulted in almost six days of downtime for a job that should have been accomplished in one. During walking the machine was emitting a loud banging noise from the Propel on one side and the shoe was deflecting outwards indicative of a broken shaft.
One of the largest contributors to the extended downtime was a lack of a plan of attack (crew had little experience) as well as having no segregated breakdown crew, but only four crews that were all capacity-loaded with planned work, so they had no spare time to deal with larger breakdowns effectively.
A suggestion I made was that any large breakdown should be managed by an external crew (with the incumbent crew helping/watching/learning) as the incumbent crew does not have the capacity to manage these larger breakdowns nor the experience.
Other delays included: spare parts weren't available immediately, shaft wasn't installed in correct orientation (possibly a contributor/root cause to the failure) and was rotated several times during the repair (was matched to the other side which was also wrong), knuckle was changed out which had no defects, blasting and weather delays, low stock of acetylene, other priorities for crew (other higher priority breakdowns, planned work), low availability of liquid nitrogen and other critical spares listed as stock weren't found.
The fact that the crews did not stop to think about the job in detail before starting to understand the scope of work required (i.e. only change shaft not knuckle) led to an extended repair timeline. This highlighted to me the importance of stopping to think about the job for a period of time to understand exactly what is required and how to do so safely.
Two - During bucket work on a dragline a fitter saw a broken lacing on the boom. The machine was immediately stood down, walked to a shutdown pad, boom lowered, five lacings replaced and walked back to the pit all within three weeks (five days of walking).
There were two major root causes to the failure, one was that the boom drawings were incorrect and hence the lacing that failed was not 125% RSL-compliant, and secondly the lacing was constantly covered in grease from the deflection tower above and during the NDT inspections it was never cleaned.
Key learnings are the importance of keeping accurate drawings, having good document/change management and ensuring that any inspections that can’t be completed (due to access/cleanliness) are raised to supervisors on the day for remediation (often these issues are buried in the report when it’s too late).
Three - A dual failure happened on a large hydraulic excavator with a fire in one of the engine bays and a hydraulic system failure leading to 11 days of downtime and a large repair bill. One of the contributing factors for the hydraulic system failure was aeration in the system which was attributed to previous hydraulic hose failures.
A learning for me was that replacing a lot of these hydraulic hoses in breakdowns introduced aeration which contributed to future breakdowns. Breakdowns often lead to future breakdowns, best to cut the cycle early with preventative maintenance.
Stefan has previous written about the importance of project readiness – read it here.
A site I worked on suffered a catastrophic failure in the main substation, leading to a complete power outage of both the site and the camp (including the mobile phone tower after the battery backup ran out). There was some redundancy, but it took several days to get everything restored so we could re-start the plant. To make things worse, it was in the middle of January and the hottest day of the year: 48 degrees on the first day of the outage! (Imagine trying to manage a 700-room camp under those conditions).
What worked well in the response was the way the site’s crisis team was stood up and operating effectively in short order. The electrical team recognised that the priority was restoring essential services, and they were prepared with the necessary plans and drawings to identify where we could connect generators to power up the camp, offices, workshops and supporting infrastructure. They were well-supported by the supply team, who managed to get something like 18 generators of all sizes onto site and installed on the same day.
Throughout the process, the crisis team was methodically working through the issues based on priority, with clearly defined roles, short interval control (updates every 1-2 hours), and regular communication. Although the response wasn’t perfect – we identified lots of improvements in the debrief – it was effective enough that it prevented a much worse outcome.
For me, the main lesson was the importance of defining and practicing your crisis management (a different concept to emergency management). It’s not possible to forecast and prepare for every scenario, but having a clear framework with defined roles and processes that you practice regularly means you avoid much of the “fog of war” that can occur at the start of a crisis.