ITSM Problem Management

The Dog That Didn’t Bark

The Millennium Dog

Some time ago (well about 20 years ago, if you’re counting) before I joined Fox IT®, I was managing an Application Management team at Cap Gemini, and started doing some Year2000 work on a system that had been written in the mid-1980s on a DEC (Digital Equipment Corporation) VAX (Virtual Address eXtension) computer. I can feel other geeks cheering me on for being fortunate enough to work on that system. By the way, I remember we had a salesman from that company come to give us a presentation once and he opened with “Please stop me if I use too many TLAs”. We all smiled and said nothing as ironically none of us actually knew he meant Three Letter Acronyms; so he carried on regardless.

The example below is of purely pro-active problem management – where a problem is identified without being triggered by an incident, and action taken to avoid the incident ever happening.

The dates in this system were encoded as ASCII text characters YYMMDD, so 31st December 1994 would be stored as 941231, and the system often had to calculate expiry dates up to five year ahead. No one working on it in the 1980s had foreseen its longevity, or considered this particular date-related constraint. Of course there’s also a possibility they might have considered it, but not thought they would still be involved with it, hence it would be “someone else’s problem”. It was indeed now someone’s problem – mine, and I applied various problem management techniques in discovering the impact and some potential solutions which would require a serious change.

Thus, in late 1993 I told my bosses “You do realise this system is going to fall over shortly after 1st January 1995, don’t you?” So, we did some conversion work and patched it up, ourselves fully expecting everyone to be off that system before the turn of the century, or for it to be “someone else’s problem”.

Roll on 4 years and we start hearing about the dreaded Millennium Bug, and we once again considered our system’s limitations. I ‘came up with’ (but not probably wasn’t the first one to consider) the idea of using different characters for the year, so we would use XXMMDD, where XX represented the years 1998 as 98, 1999 as 99, 2000 as A0, 2001 as A1, 2010 as B0, 2099 as J9, and eventually 2259 as Z9. Whatever else happened, we were pretty sure we wouldn’t be supporting that system though 31st December 2259!

This preserved the sorting sequence (A0 was ‘later’ than 99) and meant we didn’t have to convert all the date information currently held, just extend it. We altered the date-calculation routines to deal with this, and all went swimmingly.

Come 1st January 2000 myself and my colleague were there on standby (and on overtime) just in case anything was needed in a hurry, but it all worked rather well.

A month or two later we started to hear about the “Millennium Fraud” and scare-mongering, to the effect that the entire thing must have been made up by the IT industry. For our part, we knew our bit did need updating (plus endless regression testing), and if anyone had wanted to check the old system, we could have proved it was faulty. Imagine the calamity and new storm if various critical systems had malfunctioned!

The Incident Dog

That same year (2000) we started a new service for another customer, and much effort was put into dealing with many incidents, and at the end of the year the customer thanked us for our efforts because the incidents were visible, and so were our actions.

Following ITIL guidance we started to investigate and later fix many of the underlying problems causing all these recurring incidents, and gradually reduced them, and duly reported all the incident trends to the customer throughout the second year of service.

This second example is a combination of reactive problem management (triggered into action because of the operational incidents) and pro-active problem management (looking into trends to identify and reduce the underlying causes).

Imagine our surprise (not to say consternation) when the customer said “The number of incidents has been coming down and down, so we think we should be paying you less”. Fortunately, we had the information which demonstrated this reduction was largely due to our efforts, not some fluke of nature.

I always remember that situation when teaching people about managing incident and problems, and particularly the critical success factors for problem management, one of which can be measured by the absence of certain incidents. Indeed, when doing root cause analysis, sometimes the factors which don’t vary are as important as the ones that do.

 

Alan Nixon
Director of Training
Fox IT

Item added to cart.
0 items - £0.00