The idea of finding a root cause to a problem, before going into “action” mode for fixing symptoms related to issues, is a tried and true methodology. In the calm, when there’s no particular problem to address and no sense of impending doom caused by the pressure to actually get something done, the rationality of the approach makes perfect sense. The logic of being clear on what problem is being solved, identifying the cause, then (and only then) moving to action is compelling.
This helps avoid the self inflicted wounds that inevitably come with “ready, fire, aim!” events.
I was reminded of this recently when suffering an intermittent Wi-Fi outage. Some machines seemed to be impacted while others were not. A phased rebooting of equipment and a lengthy experience with the Telco eventually restored service … which then immediately failed. The real root cause? A Wi-Fi on / off switch that was hung on the tipping point of turning itself “off”. Once discovered the problem was quickly and easily solved, although I will never get the lost time back.
A far more serious event occurred in one of my data centers a few years ago when a SAN controller suddenly went into a freeze state, bringing all core processing to an immediate and agonizing halt. An immediate thought: let’s reboot / IPL everything and see what happens! Cooler heads prevailed and we did the hardest thing possible … nothing! … until we really understood the issues.
Once the issue was understood, we learned that the controller had done exactly what it was supposed to do. As the result of a fault in the original configuration (years earlier) that had hit just the right set of conditions, it stopped processing. Armed with that insight, we engineered a proper recovery which lost no data along the way. If we had followed a more action oriented approach, perhaps rebooting the device, we would have lost all the data that was in flight at that moment. The only word to describe that would have been “ugly”. By taking the more planful and thoughtful path we were able to recover with no loss of data and in a better business state than would have been possible otherwise.
Taking the time to do this was the smart thing to do, but it certainly ran counter to a human emotion that said “just do something!”.
Of course we see that “do something” mantra happening at many insurance carriers on a range issues. Whether it is because they are so close (perhaps too close?) to their own issues, IT teams frequently misunderstanding the real underlying causes for systems and operational issues. As a result, they may go after treating symptoms, which can actually mask the underlying problems. This can actually create a bigger problem with more significant, and unfortunate, long term consequences. Taking the time to do the root cause analysis, which may include bringing fresh eyes in to look at old issues, can ultimately be a faster and cheaper approach to real resolution. As counter to an action orientation as this may be, getting a plan framed and communicated can avoid all the collateral damage that can go with poorly conceived plans or a lack of proper context.
Setting context and framing issues can be a powerful way to get resources and organizational commitment to go after big issues too. This allows for proper education on issues, options and ultimately gaining organizational commitment to go after the right issues in the right way. This isn’t, however, something that is universally appreciated. In recent times, I’ve even seen some carriers shop for technology solutions to problems they don’t have … clearly not an optimal investment of assets. This can happen to organizations are too quick to jump to conclusions or somehow fail to frame issues in the correct, holistic, context for their specific organization.
Taking the time to plan things through early can avoid turning a modest problem into a nightmare scenario.