Invisible Downtime

  • Comments posted to this topic are about the item Invisible Downtime

  • I must have worked in some very strange places. This issue of observability has been buzzing around since at least 2000.

    Back then I was at BMC software who sold and used Site Angel to monitor the customer's experience of the company website.

    Later at Totaljobs (2007) there was automated measurement of Production performance multiple times per hour. The TJ process exercised just about all parts of the company website, and the only performance stats we referred to were those from the Production monitoring system. Just about everything that could be measured was measured, from time needed to load each site component to time taken for web transfer, application code and database response. All this was done for each stage of every user journey the monitoring system did. TJ also correlated any changes in performance back to code releases, and were not shy about declaring new but slow code as technical debt to be fixed as a priority.

    So the main surprise to me is that not everybody is doing this. How can any organisation feel happy about performance unless it is measured from a customer perspective. It is likely that AI can help in analysing performance, but there is a lot that can be done without it.

    • This reply was modified 1 month ago by  EdVassie.
    • This reply was modified 1 month ago by  EdVassie.

    Original author: 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • I turn on Query Store by default. Not only will it capture performance statistics over time for duration, cpu, reads, etc. - but it also counts executions and timeouts which can provide clues that the application is down or experiencing issues completing the calls.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • We're experiencing something like what you described Steve, this week, although it isn't invisible. Several servers were down on Monday. Database and application servers. I don't see any correlation between what's down. Maybe their all on the same massive server hosting several VMs? One of the strangest things about this is that when a server reboots, the services that depend upon a service account, have "lost" the password for that service account. We're all having to re-enter passwords for these services, and sometimes multiple times a day. I've love to know what's causing Window Server Services to forget service account passwords.

    Kindest Regards, Rod Connect with me on LinkedIn.

  • I had recent experience with this.  A complex data orchestration was doing exactly what it was designed to do, all dashboards and logs said that everything was fine but data was not flowing through the pipes.  No errors, no servers/services down or emitting errors, just no data.

    What it turned out to be was a YAML config file was missing a list of paths within an S3 bucket that were supposed to identify where data should be uploaded from.  It took an embarrassingly long time to pin down this error as not reacting to invalid paths is a legitimate use case.

    In hindsight I would have had the application emit telemetry of number of matched paths and unmatched paths.  I'd definitely do both as Jurassic Park had a profound influence on me.   "Stop counting when we have accounted for the number of dinosaurs we expect" Vs "Have we got fewer or more dinosaurs than we expect".

    If there was a minimum number of configured s3 bucket paths I would have had the application log an error message if the config file had fewer than that minimum number.

    If the number of s3 bucket paths that should be in the config file was small enough then I would have the application log a warning message whenever a file not in those paths was evaluated.

    Its a design issue.  Thinking about what the application needs to emit to confirm everything is OK from both a business and technical perspective as well as what is wrong.

  • The trick to solving this problem is to make getting into your software EASY and getting out impossible. One cloud CRM tool we had bought into years back helped us migrate from our on-premise solution to the cloud solution and it was trivial to migrate. A little bit of downtime, but everything went great. We later wanted to migrate to a different platform and were hoping we could just export all the data and import it into the new tool and it was impossible. The cloud solution had no export option and was unwilling to give us the backup files. They told us if we wanted to export the data, we would need to buy a 3rd party tool to do the export. We did that as it was the only option, but had a huge lesson learned there - make sure you know the process for both onboarding and leaving a tool before investing your time and effort into it.

    The main reason we were leaving that CRM is that the cost/benefit ratio was no good. It had a lot of features which justified the cost, but we didn't use and didn't need them.

    I do dislike cloud based apps though as a whole. SOME are mostly good (free email clients and forums like SSC for example), but complex apps (VM hosts, ERP, reporting tools, CRM, etc) I find just have too many issues that end users and in-house support can't help with AND external support is often very spotty. End users getting 500/502/503 errors in a web app means something is wrong server side, but getting support to look into it is harder than herding kittens. BUT if your tool is hard to get out of (changing systems) OR if the tool has no real competition (Atlassian products), you learn to live with it and be annoyed but don't change.

    At least on-premise systems I can review logs; I can test performance; I can reboot the server; I am in control. Well, my company. That would fall on IT, not DBA work, but you get the idea.

    BUT I did have this today - a windows service was up and running that monitors a folder for files to go in and it grabs the file, parses it, and automatically prints out what is needed. Worked great 2 days ago (not used daily), but today just wouldn't print. Service was running, but restarted it anyways just in case and no dice - still not printing. So renamed the file it was looking at and the thing printed. Then renamed the file back to the original name and it still printed. So in the end, I did stuff but really did nothing and it just started working. On the plus side, when it fails, it is at the trigger which an end user notices and notifies me of the problem so I can investigate and I know the software is a bit flaky and out of date, but upgrading it is a big process that I just can't justify at this time. The more annoying ones are the "doesn't work for you but works fine for me" problems. Those drive me nuts as it ends up being a BIG task to find the issue and sometimes it is user error, sometimes windows updates, sometimes solar flares, sometimes just need to reboot, or sometimes it works fine after a support request comes in.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply