A Christmas Bug

  • Comments posted to this topic are about the item A Christmas Bug

  • Ryanair had a similar issue earlier in the year (pilots and crew taking outstanding leave at the same time) with I think 18,000 cancellations, perhaps someone should look at  the similarities in applications, software providers and their respective databases to see if there is anything in common between the two, and yes there should have been more testing whether there is or not.

    Alternatively perhaps they should look at the way staff arrange their holidays and whether the glut of holiday requests was actually created by an initial shortage of pilots which prevented staff taking leave in due time, just a thought.......

    Hopefully Steve you will not be affected be this latest expose of system failure whether it be software/database or managerial....................

    ...

  • I wouldn't call this a development bug, more like a requirements oversight. I see this type of thing a good bit where a business user complains about some aspect of the software not behaving they way they wanted but they never actually said they wanted that, lol. I'm sure "check to see vacation threshold isn't exceeded" will be a line item for a future software release. 😉

  • So you resisted the temptation to call this a humbug.

    One approach to this is to have a completely separate 'sanity check' program which includes definitions of behavior that don't make sense and can override the primary application. It should be agnostic to the operation of the primary application and not share any code with it. For example, traffic lights have an override program that will throw the system into emergency mode if, for whatever reason, the main switching system tries to light greens in both directions.

    An approach like that would have prevented this kind of meltdown

    https://www.bloomberg.com/news/articles/2012-08-02/knight-shows-how-to-lose-440-million-in-30-minutes 

    Oops I see that was behind a paywall... here's a free one

    https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/

    ...

    -- FORTRAN manual for Xerox Computers --

  • Good article, Steve. I've been thinking about testing software, unit testing and so forth. I've mentioned before how learning unit testing I believe got me my current job, so I don't take this topic lightly. However, now that I've done it for a couple of years I've come to realize that its too easy to test for the wrong thing or waste a lot of time just writing ineffective testing code. For example, even though having taught myself unit testing was important in my landing my job, I've discovered that they really don't do much unit testing here. (My group does, but I think we're alone in that.) Within 6 months of my being hired we started trying to achieve 100% code coverage. Trying to do that means you write a ton of unit tests, but windup writing test code that shouldn't ever have been written. I wrote a lot of unit tests that verified assigning a property to some value would return that value. Well, of course it does that! But, we were on the mistaken quest of achieving 100% code coverage. That can get you to having testing nearly every code line (we found it impossible to completely get to 100%), but you can still not have tested the code wisely.

     I'm coming to the conclusion that there must be a more intelligent, or let's say thoughtful, way of testing your code. This involves testing extremes, like huge numbers or strings, large data sets and so on. At this point I consider it more art than science. I've taken a good Pluralsight course on it, but I pay for these courses which my employer doesn't provide and at least at this point my fellow developers and DBAs don't pay for themselves. Even so, I don't think this course covers it all. There's room for improvement on my part and others.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • I hate code coverage. I don't like it as a metric, other than a rough guess of are we doing testing. If we have 5% today, then when we add code next week, I'd like to have at least 5% still.

    Good testing is hard. You really have to think about the ways people might misuse or misconfigure software.  In addition, you have to think about unlikely scenarios and add items to check those situations.

    Did you like that pluralsight course?

  • I think the analysis and requirements stage is crucially important. One medium/large project I was involved in a couple of decades ago we spent over a week (5 people) honing these as the instruction was right-first-time. The product stood the test of time with the only real changes being made to improve performance. I was allowed to rewrite another developer's (he had left by then) module completely reducing run time by over 90%! 🙂 Subsequent to this I worked on a project where requirements were changed on a whim - one complex module was rewritten three times as the requirements were changed. Agreeing the requirements first would have saved approximately 120 man-days!

    The lack of clear requirements a shows through with my GP (doctor) surgery on line booking. It is now on its third iteration and needs different logins to say a) book a blood test requested by your GP and b) to see the results. Joined up it is not!  🙁

    Testing is also crucial. When I was asked to test some call logging software I made a call from 23:55 - 00:05. Instead of being a ten minute call it was calculated as having lasted for 23h 50m! This totally messed up all utilisation and costing vcalculations. Other areas I tested were change of time band during the call, change of month, leap years, etc. Testing boundary conditions highlights issues far more than 'simple' testing, which is what the developers had done. One guy actually accused me of trying to break his code!

  • Steve Jones - SSC Editor - Monday, December 4, 2017 11:27 AM

    I hate code coverage. I don't like it as a metric, other than a rough guess of are we doing testing. If we have 5% today, then when we add code next week, I'd like to have at least 5% still.

    Good testing is hard. You really have to think about the ways people might misuse or misconfigure software.  In addition, you have to think about unlikely scenarios and add items to check those situations.

    Did you like that pluralsight course?

    Yes, I did like that Pluralsight course. It is one I intend to repeat.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • This is more a business process failure than a software issue. Surely after the RyanAir fiasco a few months back all airlines should have been checking their processes do not allow would not happen again.
    Every holiday system I have seen has required managerial approval for leave and yes if a manager approves everyone to take their leave at the same time there will be issues running the business. The most important thing for a manager to check is that there is sufficient cover available - if not then the leave request is not approved. Sickness can make a mess of that though and so it may be that managers then need to recall staff, bring in contract labour (need to ensure they are rated for the correct aircraft) or contract the entire task out to other suppliers (in this case that would mean transferring passengers to other carriers)

  • It is all too easy to say, "Thisis definitely a development mistake, and one that should have been caught intesting." My first question is how big was the testing budget? The problem domain is complex even before the scale is considered. And then one must consider that the scale of the problem might in fact make the complexity even worse. Thus, this could well be as much a business process problem as a software development issue. for instance did they budget enough resources to test something at scale? Did they add flights after they granted vacation? Did they get hit with an unplanned sickness pattern?

    The thing is that their scheduling is way more complex that a typical vacation scheduling algorithm. And when a flight is canceled it will have a huge ripple effect on the system. A given plane might fly five or more flights in a day. Miss an early one and even if you have the pilot the later gets canceled. Add in the effect of the plane not being in the right place the next day and you cancel more flights. Throw in this complexity: what if one of those canceled flights was to carry a pilot to another location you are now canceling even more flights.

    Now scale the whole thing out to 6700 flights per day and account for complex scheduling rules. Also add in budget for resources to test this thing at that kind of scale. Don't forget the lead times needed when a change is needed based on the number of testers you budgeted for. This is to say nothing of business assumptions that may have been working fine for many years suddenly not being the case this year.

  • crmitchell - Wednesday, December 6, 2017 5:43 AM

    This is more a business process failure than a software issue. Surely after the RyanAir fiasco a few months back all airlines should have been checking their processes do not allow would not happen again.
    Every holiday system I have seen has required managerial approval for leave and yes if a manager approves everyone to take their leave at the same time there will be issues running the business. The most important thing for a manager to check is that there is sufficient cover available - if not then the leave request is not approved. Sickness can make a mess of that though and so it may be that managers then need to recall staff, bring in contract labour (need to ensure they are rated for the correct aircraft) or contract the entire task out to other suppliers (in this case that would mean transferring passengers to other carriers)

    +1 on the business process failure, however to throw in another variable: we are all thinking staffing levels, responses seem to suggest that is a factor, but we are talking air travel, so if there are sufficient pilots they may not be in the location required for a particular flight (ten pilots in LA and a plane lands at Dublin at the end of the pilots flying time and no one available) So how would the algorithm for that look? Still looks like business process failure, can that be done with SQL? I think it needs management coordination, what do you think?

    ...

  • HappyGeek - Wednesday, December 6, 2017 8:11 AM

    +1 on the business process failure, however to throw in another variable: we are all thinking staffing levels, responses seem to suggest that is a factor, but we are talking air travel, so if there are sufficient pilots they may not be in the location required for a particular flight...

    No problem, just bounce a few passengers.....

    I might also comment that this is not a new problem, it was first addressed 150 years ago by the railroads.

    ...

    -- FORTRAN manual for Xerox Computers --

  • An inside programmer for this AA problem tells me it was a user error.  The user that had been using the program retired and the new one made the mistake.

  • kiwood - Wednesday, December 6, 2017 7:33 AM

    It is all too easy to say, "Thisis definitely a development mistake, and one that should have been caught intesting." My first question is how big was the testing budget? The problem domain is complex even before the scale is considered. And then one must consider that the scale of the problem might in fact make the complexity even worse. Thus, this could well be as much a business process problem as a software development issue. for instance did they budget enough resources to test something at scale? Did they add flights after they granted vacation? Did they get hit with an unplanned sickness pattern?

    Development includes testing. Include architecture, planning, specs, code, testing, verifying deployment.

  • HappyGeek - Wednesday, December 6, 2017 8:11 AM

    +1 on the business process failure, however to throw in another variable: we are all thinking staffing levels, responses seem to suggest that is a factor, but we are talking air travel, so if there are sufficient pilots they may not be in the location required for a particular flight (ten pilots in LA and a plane lands at Dublin at the end of the pilots flying time and no one available) So how would the algorithm for that look? Still looks like business process failure, can that be done with SQL? I think it needs management coordination, what do you think?

    Perhaps, but we as humans have come to depend on the systems to show us errors or potential issues. As pointed out, the scheduling problem is far too vast for a human to comprehend. Also think that there are likely other managers, so any individual would assume others haven't approved vacation for too many pilots.

    There are business issues, but this certainly is a software issue somewhere in development that it could happen.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply