Too many IT people step over the line from cutting edge to bleeding edge or wind up just plummeting off the edge entirely into an endless free-fall of bad consequences for all end-users, all for the lack of a little forethought and well-considered conservatism. I’ve seen it one too many times.
Whether the IT person is in-house talent or outside service personnel, the result is too often the same. An hour after they leave, or sometimes moments after they arrive, the system comes screeching or grinding to a halt over a careless action. And they are either unavailable or powerless to fix it, leaving the end users holding the bag.
Example? How about the IT person who decided to just shut down a network without prior notice or query to implement a “3 minute” power supply replacement, only to find the system would not reboot with the new supply, and the old one officially died upon removal. Oh, did I mention that a lawyer was in the conference room in the middle of a 4+ hour complicated closing at the time? You would have thought it was the fourth of July based on the fireworks.
Example? How about the IT person who heard a funny noise coming from the PBX phone switch and decided to reboot it on the chance it would take care of the problem. Want to take a guess? Yep, you got it. It was an impending hard drive controller failure, and the hard drive failed when it stopped spinning. The firm was without phone service for over 24 hours, and lost all its voicemail messages, current and saved.
Example? How about the IT department at Research in Motion making a minor software update that was supposed to optimize system cache memory, without adequately testing it first? Want to take a guess? Well, if you know any Blackberry users, you probably already know most needed a padded cell and/or methadone to deal with the withdrawal symptoms when the system failed and left literally millions of users jonesing to use their thumbs from late Tuesday, April 17th, into Wednesday the 18th. Complicating matters was the fact that the company’s backup system also “performed poorly.”
“The system routine was expected to be non-impacting with respect to the real-time operation of the BlackBerry infrastructure, but the pre-testing of the system routine proved to be insufficient,” Research in Motion said in a statement.
I don’t care whether you outsource your IT, have your own in-house talent, or if you literally do it yourself. Just always keep in mind that you should never mess around with anything, no matter how “simple” or “straight-forward” or “non impacting” you think it will be, without first asking yourself what you can do to undo it if the worst happens. Do you have the ability to “go back” software-wise short of a full restore? If you have to do a restore, do you have a current backup? Have you done a restore recently, or even tested your back-up to make sure it will work as desired when needed?
If the system is down, even for a few minutes, who will be affected? Have you checked with them to see if they can deal with it without adversely impacting client deadlines?
How will you fix it if it turns into a boo-boo? How long will it take?
Always test everything you can in advance. Always make sure you have a full software backup before installing any software, patch update, whatever. Never ever power down unless you are pretty darn sure you can power back up, or have back-up hardware available. One thing I’ve learned time and time again — if you suspect a hardware problem, the worst thing is to power down before a technician with replacement parts arrives or is enroute. Ok, we’ll make an exception when there’s smoke coming out of the exhaust vent. 🙂
If we learn nothing else from RIM’s recent outage, it should be to remind us that there is virtually nothing we can do to our computers which is truly “non impacting,” so always take a conservative approach.