What do you do when you inherit a mess at work?

  • Ed Wagner (12/11/2016)


    ...nothing was done because it was deemed to be too much work...

    Of course, that's always the reason given. The people who give such a reason are usually managers who are oblivious to the work being done to maintain existing systems. They refuse to consider the additional infrastructure and support staff necessary to keep inefficient systems running, although they are very quick to deny funding for such things. They refuse to recognize the tentative, dangerous, and time-consuming efforts necessary to modify existing systems, although they are very quick to demand changes. They refuse to acknowledge the costly pain and expensive suffering for the organization when existing systems fail to deliver as anticipated, although they are very quick to point out how "cost prohibitive" and "hard" it would be to build something effective (as if the building of existing systems had been done by some kind of magic).

    Some existing systems are effective. Some existing systems are crappy. The status quo is never desirable for a crappy system. I will put more faith into the incremental improvement approach when I see any organization actually make it happen. They may claim that's their approach, but when they spend 99% of their time just trying to keep the lights on they are simply accepting the status quo. They may claim it's practical to replace the foundation of a crappy system while it's live in production, but I think that's harder than rebuilding from the ground up. Further, trying to replace the foundation of a crappy system while it's live in production inevitably leads to self-defeating technical compromises. So, yes, when I inherit a crappy system I lean toward owning up to the technical debt, learning from past mistakes, and getting off to a fresh start.

    Creator of SQLFacts, a free suite of tools for SQL Server database professionals.

  • Wingenious (12/11/2016)


    Ed Wagner (12/11/2016)


    ...nothing was done because it was deemed to be too much work...

    Of course, that's always the reason given. The people who give such a reason are usually managers who are oblivious to the work being done to maintain existing systems. They refuse to consider the additional infrastructure and support staff necessary to keep inefficient systems running, although they are very quick to deny funding for such things.

    Well, you hit the nail on the head there. 😉

  • Ed Wagner (12/11/2016)


    Wingenious (12/11/2016)


    Ed Wagner (12/11/2016)


    ...nothing was done because it was deemed to be too much work...

    Of course, that's always the reason given. The people who give such a reason are usually managers who are oblivious to the work being done to maintain existing systems. They refuse to consider the additional infrastructure and support staff necessary to keep inefficient systems running, although they are very quick to deny funding for such things.

    Well, you hit the nail on the head there. 😉

    Yep.

    I inherited a system hosted on, not kidding, SQL 7! I didn't even know about the server until it showed up on a security vulnerability scan. The app had to be rewritten and no one wanted to touch it. My argument of 'its our code, we can fix it if it breaks' finally won the argument and it got ported to something modern.

    It took us about 2 weeks to get all the functionality back but in the end we had a faster more reliable and consistently backed up product.

  • I think that all three approaches are needed, and more importantly I think that for some messes it is essential to use two of these three approaches, and for some extreme cases it is necessary to use all three. But the nuke option is a disaster when not preceded by careful analysis.

    A little over 41 years ago is was doing time in an ivory tower and wanted to get back to the real world, and my boss's boss (who had been my direct boss until he got promoted to run all the company's software development in Scotland, and had taught me everything I knew about management) suggested I might like to look at taking over management of development of mainframe data communications for the company's latest server range, as the senior manager responsible for a large chunk of server software wanted someone who could come in and mend a disaster - the development had had too many managers who had failed too quickly and the current one was doing no better. So next time I was in England, I talked to that senior manager about the position, and we found it looked like a good bet. So I took it on.

    It involved me moving from Scotland to NW England, as that was where the development was going to be done; but the development was currently based in the Thames valley, and it seemed unlikely that many of the people there would be prepared to move (everyone had said they wouldn't move, but I hoped to change some minds), so I was going to have to recruit people and get them to spend time in the south to learn about the stuff while planning to move to the NW after a few months, so there were complications other than technical ones.

    The software had been released to customers, and was terrible - full of bugs (and there was an enormous background of customer-generated bug reports that no-one had even looked at) and tended to crash in a way that forced the server to reboot about once a day; if comms equipment reported a problem and the software managed not to crash the server as a result mending that equipment didn't didn't allow you to use it until you rebooted the server (so maybe it would have been better if it had crashed the server, as most servers were only accessible - other than from the control console in the server room - through comms equipment).

    The existing development people had enormous morale problems caused by the way they had been managed - managers had paid no attention to what their people told them, promised the moon to higher management despite being told it was unacheivable, demanded people work shifts without shift allowance, demanded overtime working without pay, and hid away in an office and made themselves inaccessible to their people.

    The software documentation was not at all helpful - it described an architecture which mostly had not been used when designing the software; it described design for about 25% of the software (the other 75% had no design documentation) and of that 25%, about half had actually been implemented using that design (which was perhaps unfortunate, as it was not really compatible with the other half).

    I attacked the mess in three stages.

    The first stage was essentially passive. I hunted for short-term commitments that could be thrown away (a new release with lots of new features was due shortly - most of the new features hadn't been designed, let alone implemented) and managed to eliminate a lot of pointless bloat. Having gone through all the documentation and looked at the code I was pretty horrified, so I spent some tome going through the bug-report backlog, examining server core-dumps and working out what had caused crashes - that allowed me to learn quite a lot about the software and why it was so bug-ridden, and gave me some ideas about how to bring in some error management. Someone from a customer support role wanted to come into the project and develop some decent diagnostics (traces, state maps, and so on) and I welcomed him with open arms. Fixing morale seemsed crucial, so I told people that working shifts an failing to claim the allowance was a sackable offence, and so was working overtime and not claiming either pay for it or time off in lieu, as wll as talking to individual developers and sitting in the open plan with no office walls or dragon-lie secretary between me and the people I was managing. I told divisional management, evey week at the review meeting for progress on the server OS release, that they would not get all they had been promised - if fact most of it wouldn't happen; and I made sure that my people knew I was saying that. It did improve morale, and that resulted in a big improvement in rate of getting things to work. It also meant that people were happy to talk to me about technical problems, which meant we could work out what to do about some of them, and we could fix a lot of bugs, and also work out which things were just not addressable on the timescale of the imminent operating system release (which meant I could give senior management faitly reliable predictions, instead of unfounded guesses).

    That worked - we made the release (having eliminated unnecessary features and impossible features from its definition) and it wasn't anywhere near as bad as its predecessor. So passive worked in the short term.

    The second stage was essentially analytic. We (the new team, which incuded one section-leader level designer and one senior programmer who came from the southern group - the rest were new recruits, some brought in from other areas or the company but most recruited from outside) were scheduled to push out another release in a few months. It was supposed to have rather a lot of new features in our area, but it was specified to be available only to certain customers - a very small number, but very influential. It also was using a new development package (improved compiler, improved source control, improved build control, improved linking, improved GUI, and appallingly slow system for taking on build requests for components and executing them). Once again I went round the loop with product managers and customer support people (much easier this time, given the restriction to specific customers) and so on and managed to scrap a lot of the new features. One area I added to instead of cutting back was error management (someone had actually suggested doing some sensible stuff in that area, and had managed to get some through; I added bits in that area). Once we had a fixed release content, we analysed every component of the software whether or not it might be impacted by the new features and worked out what we thought should be done to it to make it sufficiently reliable that it wouldn't create a support problem. Then we did those things.

    That release was extermely successful. Quite soon after it went out we stopped getting bug reports on the previous release even from customers who were not supposed to get the new release; someone in customer support had used some common sense and, knowing that this release was much more reliable than the previous one, quietly made it available to anyone who had problems whether they were on the list o who could have it or not.

    So analytic worked too - but from have done the analysis it was absolutely clear that the current structure had to be thrown away.

    The third stage was the nuke option. We would never get decent performace without throwing away the crazy parts of the current architecture, so over the next year an a half we were doing two things: we supported what we had released, and we spent a lot of time on architecture and design for a complete replacement of the current code. So we did the design. Then we implemented it, and tested it. Then we released it. The customer reaction was terrific - "<xyz company> achieves communcications excellence" headline from an outfit which had been our worst critic.

    So the nuke option also worked.

    And we used all three.

    Many years later (end of 1999), I joined an outfit that was intending to map the web in such a way that it could track when a page moved. This was nearly 17 years ago, so that was not then a obviously insnane as maybe it is now. Anyway, I went in as director of research, which meant I had to check underlying assumptions. In my first week or two, I found a mess and fixed it - but I'm not sure whether this counts as (i) passive or (iii) nuclear or even, perhaps, (ii) anayltic.

    There was this proposal, and it said it would fit the db that allowed it to do that on a PC. So I did some calculations, and looked up some numbers. Then I had to worry about numbers that didn't seem to be written down anywhare, like how many links are there on an average page, or how quickly are things changing. Easy enough to trawl through some (small) subset of the web and see what a sensible amount of storage would be. Assuming I had to cover a couple of years into the future, th eanswer I came up with (based on using sql server 2000) was 40Tbytes; and to keep it up to date within about 3 weeks would require 42 top of the range servers.

    We didn't have to scrap anything, because there wasn't anything to scrap. So not a nuke. We didn't say "right, we can start with that PC and then grow as required" so probably not passive. So maybe this is just analytic. Except that there was no software or database implementation to analyse?

    So I believe that in early days, when the mess is just a very small mess that hasn't yet grown into a disaster, the analytic apprach is best.

    In 2002 I joined a company where a lot of stuff needed nuking. One major component was already in the process of being nuked, but the problem was not that component, but the architecture: as long as the system architecture was unchanged, that component would be a performance-destroying disaster. This is a beautiful example of the nuke option not working unless you do enough analysis to discover which system components have to be nuked before you start rewriting. And also a beautiful example of what happens when you build a pointless bottleneck into your architecture (the architecture specified that all communication between system components would be routed through that particular app, although it contributed exactly nothing to 99% of such communication).

    Tom

  • The final response I have seen is nuclear... .

    When I read that I initially thought that it was pulling hair out, ranting, and threatening to leave the company for a proper company etc etc. :w00t:

    And yes I have seen that happen.

Viewing 5 posts - 46 through 49 (of 49 total)

You must be logged in to reply to this topic. Login to reply