Mission Critical Deployments

Question

Mission Critical Deployments

Steve Jones - SSC Editor

SSC Guru

Points: 734449
More actions
August 15, 2015 at 11:29 am

#299774

Comments posted to this topic are about the item Mission Critical Deployments

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 1

I'm always amazed that people don't pay attention until someone gets hurt or dies. This appears to be the same mentality that prevails in every day code. They want it real bad and that's the way they get it. Gotta meet that damned schedule that was over promised in an elevator or on a golf course to begin with. The old adage of "no one will die if we don't get it right" apparently bled over into a place where people actually do die if you don't get it right and I don't believe this incident will prevent the problem from spreading. To be honest, I'm both sickened and disgusted that this happened and wish that people actually and always wrote code as if someone's life depended on it because, in one shape or another, people's lives do depend on our code. If we make a mistake, no one might die but it sure could hurt a company or individuals (unencrypted SSNs being my "favorite" subject there).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

David.Poole SSC Guru Points: 75898 More actions · Answer 2

Until someone of suitable authority ring fences quality control and says "short cut and descope what you like but thorough testing is non-negotiable" this will continue to happen.

In theory testers should be brought in early to make sure that suitable test coverage is built in from day one. My observations are that testers are regarded as a lower caste and have much of their work classified as nice-to-have. Towards the end of the project test time is treated as project contingency time.

LinkedIn Profile

cjb110 SSC Enthusiast Points: 160 More actions · Answer 3

Wiped sounds unlikely...more likely is the installation process created the default file, overwriting what was there.

It makes me think that they're using more general purpose computers rather than specific 'machines'/'chips' (not sure the best term here!). During development of the latter its more likely that this issue (installing over the top), would have been addressed.

Plus more importantly with specific 'machines' the number of different states is far reduced, with general purpose the state space is ginormous, and writing code to handle each state is extremely difficult.

Eric M Russell SSC Guru Points: 125522 More actions · Answer 4

I hope that mission critical software for things like airplanes and nuclear reactors don't run on Windows or Linux. I'd expect something like an embedded chip that contains both the operating system and the application. A "deployment" would simply involve pulling the old chip, plugging in the new one, and running a diagnostic. In a worst case scenario, if something goes wrong post-deployment while in flight, then the pilot can just hit a reset button and at least have the system functional enough for an emergency landing.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

roger.plowman SSChampion Points: 10265 More actions · Answer 5

Part of the problem is the rapid release cadence mentality the industry has been seduced/coerced into. Win 10 updates come to mind. It isn't just planes crashing that cause problems, although death always trumps discomfort!

Continuous integration is another aspect that severely impacts testing time. Somebody's going to have to tell the collective C-Level PHBs the old saw about "Good, fast, and cheap, pick any two".

Looks like they're choosing "fast and cheap". Good? They've heard of it... :angry:

eric.notheisen SSCommitted Points: 1643 More actions · Answer 6

The first launch of Commercial Titan by Martin Marietta Astronautics Group in the late 80's early 90's was a failure because a software parameter said the satellite to be released was in the lower bay when in fact it was in the upper bay of the missile. Software testing is important in all cases; in this case the testing should have included the requirement of satellite location.

jckfla SSChasing Mays Points: 632 More actions · Answer 7

Sounds like a typical case of where "risk mitigation" failed and cost lives.

Also sounds to me like not only faulty push mechanism in placing the new software/init files/parm files on the system, but a poor initialization routine in the hardware to check that all necessary files are there to a) initialize the system, and b) initialize the setup for the new software.

Makes me sad that tech has been used more predominantly as a profiteering tool, rather than a mechanism to allow more checks and safety to be implemented in the same amount of time/cost used prior to its use.

Oh well. Hopefully this will help to institute a change.

Steve Jones - SSC Editor SSC Guru Points: 734449 More actions · Answer 8

roger.plowman (8/17/2015)
Continuous integration is another aspect that severely impacts testing time. Somebody's going to have to tell the collective C-Level PHBs the old saw about "Good, fast, and cheap, pick any two".
Looks like they're choosing "fast and cheap". Good? They've heard of it... :angry:

I'm not sure that's true. CI ensures that tests are run every time, on every commit. If testing isn't being done, it's not a problem with CI, it's the way management is looking at the process.

There are plenty of things humans find that automated testing does not. However there are plenty of benefits that automated testing has over humans, mainly reliable repeatability. We stink at those things.

jay-h SSCoach Points: 18816 More actions · Answer 9

Eric M Russell (8/17/2015)
I hope that mission critical software for things like airplanes and nuclear reactors don't run on Windows or Linux. I'd expect something like an embedded chip that contains both the operating system and the application. ...

I'm not sure that a dedicated OS would improve reliability. Certainly in the Airbus crash it does not seem to be the issue. Linux (and Windows for that matter) has millions of 'testers', i.e. flaws come to light (and many more potential developers are available). A hard boot chip would be good, but software still requires lots of non volatile memory (equipment on this plane, service history, observed historic behavior parameters). Not sure of hot-swap during flight, especially of different release numbers is a boon to safety.

Before the mid 80s, most military equipment ran on custom, government designed chips. The result was the military equipment was constantly using older technology, not because their engineers were stupid, but because the huge customer base and engineering pool that outfits like Intel had naturally produced faster progress. Though the rigidity of government specs might have been a factor: An Army engineer working in procurement once told me that for years leather seats in helicopters had to be treated with a horse urine process. This was a spec carried over from the cavalry days, so that new saddles would not spook horses. This was quite a few years ago, hopefully things have been fixed up a bit since.)

At the same time, both software and internet connectivity is being used by manufacturers as a kind of engineering crutch. The recent demonstrations of hacking internet connected automobiles (I'm a bit of a gear-head and have some strong opinions on this) raise questions as to why mechanical systems are available to off site connection at all. If they wished to provide a LAN for the user's phone or SatNav, there is no reason at all for that to be connected to anything else.

But the engineering problems are deeper. Systems are not siloed. The CAN bus (which controls body systems, the engine management, transmission management, brake management are all inter connected -- and it's not uncommon for a failure in a relatively minor part (power window, radio) to crash the entire system. Traditional brake and steering were fully operational even with engine or electrical failure, and completely independent of anything else, but we're moving away from that. More and more controls are just computer inputs, and has been recently demonstrated, these are computer inputs that can be hacked.

Dependence on software patching is another area where sometimes engineering is done to 'get the product out the door' and it will be 'fixed later'. Not to pick on these companies, they're far from alone, but it's an area where I have some background knowledge. ZF 9 speed transmissions (most common in the US on Chrysler products like the Cherokee, but also present in other US and foreign models) have had multiple patches since they came out (the first was before the products were even shipped). Each patch is supposed to improve the shifting, but they often seem to introduce new problems, and another patch comes down the pike. Now these machines are very complex, 9 speeds shifted with dog clutches is a very difficult thing to do well. Perhaps even impossible, but in previous times, this would not have been attempted on mass market vehicles (though in fairness, pressures from CAFE standards are pushing a lot of manufacturers to produce design 'features' that are destined for failure down the road.)

And then we have the much ballyhooed 'Internet of Things'. What could go wrong?

...

-- FORTRAN manual for Xerox Computers --

Kim Crosser SSCommitted Points: 1763 More actions · Answer 10

I have the following Tenets regarding Software (or System) Deployments that I try to share with developers.

It looks like Airbus missed multiples of these - highlighted - the updated failed to preserve configuration parameters, there was no validation that configuration parameters were correct (or at least in a possible valid range), and there was no Installation Verification Procedure (or it was certainly missing some key steps).

•IT IS NOT POSSIBLE FOR DEPLOYMENT TO BE “TOO” AUTOMATED

It is possible for an automated deployment process to be too restrictive and fail to allow for site-specific requirements. However, any deployment process steps that “can” be automated should be automated.

Every step that requires people to manually apply changes introduces opportunities for unexpected problems.

•ONE HOUR OF AN INSTALLER’S TIME IS WORTH AT LEAST TWO WEEKS OF A PROGRAMMER’S TIME

Any manual step or series of steps performed by a person during an install/update that take about an hour of time is worth expending at least two weeks of a programmer to automate (“Kim’s Rule of Thumb”). This is regardless of whether the installer is on-site or remote.

Time expended by a programmer is a one-time expense. Time expended by an installer is repetitive. Thus, even though 80 hours of programmer time may seem expensive versus 1 hour of an installer, unless we are planning to deploy fewer than 80 total systems, it will eventually pay a return on investment.

Further, each hour of an installer performing manual steps introduces multiple opportunities to create a problem, and even one or two of these can easily consume that 80 hours diagnosing and recovering from the problem.

•ALL CONFIGURATION SETTINGS MUST BE PRESERVED DURING UPDATES

Any configuration settings that could possibly be different from one site to another must be preserved unchanged through any system updates. This includes settings that are only accessed by installer or engineering personnel as well as all settings that may be altered by customer personnel.

The only exceptions are settings that explicitly must be changed as part of the update itself. If any of these settings are customer-accessible, these must be documented in the PRD and disclosed to the customer prior to the update.

It is completely unacceptable to require installers to take manual steps to preserve and restore configuration settings.

•UPDATES MUST HANDLE LESS-THAN-PERFECT ENVIRONMENTS

If an update requires some preconditions to exist, the update process must explicitly test for those preconditions and either correct/install the preconditions automatically, or terminate with appropriate error messages.

Updates must never proceed to completion with any unexpected errors in any intermediate steps. Any error that cannot be resolved automatically by the update process must result in unambiguous error messages and termination of the process.

•ERRORS AND WARNINGS MUST BE CLEARLY VISIBLE

Any errors and/or warnings must be clearly presented to the person running the update. Installers must not be required to examine a log file or database table to determine if anything unexpected occurred.

If any errors or warnings occur during a series of automated steps, then the update process must be able to collect all the errors and warnings and present them clearly at the end. It is particularly unacceptable to require installers to inspect multiple log files and/or database tables to determine that everything worked (or didn’t work).

•CONFIGURATION PARAMETERS WITH A LIMITED SET OF OPTIONS SHOULD BE ENUMERATED LISTS

If there are only a limited set of possible values for a configuration parameter, the installer or customer should be able to set the parameter from an enumerated list, rather than a free-format text field.

•CONFIGURATION PARAMETERS MUST BE VALIDATED

Any configuration parameters which have a known set of acceptable values, or at least a known range of values, must be validated by the update/install process.

Failing to validate configuration parameters and expecting the application to detect the configuration errors during normal operation (or even start-up) is unacceptable. By the end of an install/update, the system should be capable of operating without encountering errors due to incorrect configuration settings.

•CONFIGURATION PARAMETERS MUST BE CENTRALIZED

Any configuration parameter that is not unique to a specific server must be centralized and not repeated across multiple servers.

Ideally, all such configuration parameters should be stored in a system database, from which they can be retrieved and used by all servers in a system.

Allowing configuration parameters to be duplicated in multiple servers creates the possibility that one or more of the servers will have a different configuration value, potentially resulting in problems that can be exceedingly difficult to diagnose and resolve.

•AN INSTALL/UPDATE SHOULD INCLUDE AN INSTALLATION VERIFICATION PROCEDURE/PROCESS

Ideally, an automated “Installation Verification Procedure” (IVP) should be run at the end of an install or update process to verify that all install/update steps were completed successfully and correctly.

An “Installation Verification Procedure” document can be used where an automated procedure is unavailable for whatever reason, but the “one hour = two weeks” rule above applies here as well. If the steps to conduct the IVP manually take more than a few minutes, the process should be automated.

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 11

Wow. Just got back from a flight halfway across the country. So glad they didn't have a glitch in their software.

Eric M Russell SSC Guru Points: 125522 More actions · Answer 12

Iwas Bornready (8/18/2015)
Wow. Just got back from a flight halfway across the country. So glad they didn't have a glitch in their software.

Yeah, nothing like hearing the jet engines suddenly go dead silent while in route somehwere over the Atlantic, and then a flight stewardess runs into coach yelling "Do we have a computer programmer on board?". :w00t:

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

webrunner SSC-Dedicated Points: 31718 More actions · Answer 13

I'm a little late to this discussion, but having read "The Checklist Manifesto" by Atul Gawande, I think a whole set of software checklists needs to be run the same way the other airplane pre-flight checklists are run. Checklists have been extremely effective for airplane safety, and it seems like tragedies like this one might be avoided if checklists included things like making sure essential files are present in the airplane software. I'm sure they have checklists similar to this already but modifying them to make sure they are run just before the airplane takes off (point of no return in most cases) they would have a chance to ground the flight if some piece of code is missing or has been messed up.

- webrunner

-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

Eric M Russell SSC Guru Points: 125522 More actions · Answer 14

It seems to me that, even in the event of a catastrophic computer system failure, an airplane should still be functional enough to make an emergency landing under manual control.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho