I’ve grown up reading Tom Clancy and probably most of you have at least seen Red October, so this book caught my eye when browsing used books for a recent trip. It’s a fairly human look at what’s involved in sailing on a Trident missile submarine…
We have a solution called Data Architecture Review. With this solution, we conduct a complete review of the customer’s data environment. We collect a lot of data from the servers, analyze it, and produce a list of recommendations.
In most cases, this is the entry point to our engagement with a new customer. We usually begin with the review. When the customer sees the results of the review, they become a happy customer. From that point, we develop a long-term relationship with the customer, and make sure that they stay happy.
But it wasn’t always like that. Before we developed the Data Architecture Review solution, we did things in a different way. And we didn’t always manage to develop long-term relationships or keep our customers happy. We made a long way since, and we are really proud of where we are today. So I would like to share with you a story. A story about our journey and about why I believe in the Data Architecture Review…
Once Upon a Time…
In many cases, customers reach to us when they have performance problems with their data platform. Sometimes they are very specific, like: “Procedure X runs for 6 seconds; It should be less than one second”. In other cases, they can say something very vague like: “Everything is slow, we need to improve the overall performance”. Some customers complain about availability issues, usually after they experienced a disaster or two. Yet other customers just want us to make sure that the data platform is properly configured.
These patterns of customer requests haven’t changed much over the years. Our reaction did. In the early days, things were quite simple. We would charge the customer by the hour, and then one of our data professionals would work with the customer on whatever they want. So if the customer wanted to do performance tuning on stored procedure X, that what we would do. We acted like contractors. You tell us what you want us to do, and we’ll do it.
It turns out that in most cases the customer doesn’t know what the real problems with their data platform are. They may know that stored procedure X is slow, but they haven’t got a clue about the root cause of the problem. This is why they turned to us in the first place. If they had the skills to identify the root cause and solve all their data related problems, then they wouldn’t need us, would they?
What they should have said was: “We need your help to understand what the main data problems are, what the root causes are, and how to solve them”. Instead, they usually say something like: “We need you to tune stored procedure X”. And instead of telling the customer what they really should do, we used to tell them “OK”.
So here is a common scenario we used to have with our customers. The customer says “please tune stored procedure X”. We say “OK”, and do things like rewriting the code or adding indexes. We manage to reduce execution time from 6 seconds to 2.5 seconds. The customer appreciates it, but they are still not happy, because they actually want sub-second performance. End of story. Oh, and a week later they have an outage, and they lose 2 hours of data, because their DR solution is not properly configured and was never actually tested. But they didn’t ask us to look at their DR solution, so it’s not our fault, right?
The New Story
A few weeks ago, a new customer turned to us and said: “We need you to improve stored procedure Y”. Since we’ve learned a few things over the years, we replied “maybe, but let’s look at the system as a whole, and look for the root cause of your performance problems”. We explained about the importance and the benefits of a complete Data Architecture Review. Fortunately, the customer understood and wanted us to begin the review right away.
We found that their single database was running a mix of OLTP and reporting workloads. It was quite a busy database, and because of that mix, some database configurations were adjusted for OLTP, while others were adjusted for reporting. There were also lots of indexes (more than 20 on the main tables), which were pretty useful for the reports, but incurred a lot of overhead on the OLTP workload.
The stored procedure that the customer complained about (“Y”) was responsible for executing some OLTP process. But due to the way they configured the database, and also due to the number of indexes, it was quite slow. It was also affected by locks held by other processes (other OLTP processes and also reports). It was executed 6 times per second during peak hours with an average execution time of 4.5 seconds. Not good.
So we offered the customer 3 solutions – one for the short term (a quick fix), one for the medium term (cost-effective) and one for the long-term (the best solution). In the short term, we analyzed the stored procedure, and applied some magic to it. First, we rewrote some of the code. We removed some unused indexes, and added one missing index. Then, we added some locking hints and better transaction handling in order to reduce the amount of blocking. We did some other things, and we managed to reduce the average execution time from 4.5 seconds to around 1.5 seconds. It was good, but not good enough. We also did the same for several other stored procedure, which we identified to have significant performance issues.
The medium term solution was to set up another database on the same server, replicate all the relevant tables to it (without modifying the schema), and move all the reports to that database. This way we split the reporting workload from the OLTP workload without too much effort. This solution removed the locking problem almost completely. Also, we could now adjust the indexes in both databases to match the workload in each one. So we had only the most critical indexes in the OLTP database, and a lot of other indexes in the reporting database.
We made a significant performance improvement, but it was still limited for two reasons. One is that we are still using the same server for both workloads, so they share the same resources, such as CPU and memory. The second reason is that the schema of the database is designed for OLTP and not for reporting, so some of the reports still take a lot of time because they need to perform complex joins and aggregations.
The long term solution is, of course, to design a new reporting database on another server, with an optimized schema for reporting, and also develop an ETL process to move the data from the OLTP database to the new reporting database. This solution will maximize the performance and scalability of the system, but it requires much more effort. So now we are discussing the pros and cons with the customer.
Oh, and we also found some security holes in the database configuration. For example, there were several database users with sysadmin privileges and really weak passwords. We fixed all of these problems, of course.
The important thing here is that by performing a comprehensive Data Architecture Review, we were able to give the customer a complete view of what’s really going on in their data platform. We identified the root causes of their performance problems. We reviewed the state of the system in terms of availability. In addition, we identified security issues. And we offered durable solutions to all the issues we found.
Indeed, the customer is very happy, and I have no doubt that this is a beginning of another long-term relationship…
If you want to learn more about our Data Architecture Review solution, then click here…