When I first learned how to write code, I was taught that code should be measured by three main attributes: Accuracy, Readability, and Performance. Accuracy means your code does what you need it to do. Performance means making the best use of system resources. Readability is how well someone can read your code and understand what it’s doing. I’ve learned the hard way that you will never be perfect in all three areas.
Accuracy: It does or it doesn’t
One area where code does have to be perfect is accuracy. If your code doesn’t do what it needs to do, it doesn’t matter how fast it ran or how well others can understand it. Could you imagine being at a code review and hearing, “you can clearly see my code is useless!”? What about complaining to a developer that the results of a report are wrong only to hear, “but it now runs ten times as fast!”? Code accuracy is one of the few black-and-white areas of development. The code we write either meets requirements, or it doesn’t.
Performance: Being nicer to the machine
Performance also makes sense to developers. Especially with SQL, your code will likely be running on a shared system that has to manage CPU, memory, and disk. Using these resources as efficiently as possible is key to maintaining a healthy relationship with your server team, and getting fast results will keep your customers happy.
Readability: Being nicer to the people
Readability can be a harder sell, but it’s also important to consider. In software, we are seeing less and less the idea of “done” code. Changing data, changing users, and evolving requirements mean that the code we write will be revised and enhanced. Maybe not even by us. Whether it’s days or months later, readable code means we can debug, fix, and improve it faster. Making your code readable also means that a team member can easily modify a single step of a complicated process. Just as performance can be thought of as helping the system deliver results faster, readability helps us deliver fixes and enhancements faster.
As I mentioned earlier, a developer cannot compromise on accuracy. That leaves performance and readability as two traits that can come into conflict. Making code more readable could mean more table reads/writes or temp tables. Making code more performant could also make it more confusing to the reader. When these areas conflict, it’s important to understand how to make the trade.
How do you find the balance?
The answer, of course, is that it depends on the context. Take a look at the script below. The requirements are to generate a report of all active employees along with their salary information and retirement level.
/********************* Readability-leaning *********************/ --1) Grab our initial employee list INSERT INTO EmployeeReview(EmployeeID, FirstName, LastName, HireDate, Salary, Retirement) SELECT ID AS EmployeeID ,[First] AS FirstName ,[Last] AS LastName ,StartDate AS HireDate ,Pay AS Salary ,'Basic' AS Retirement FROM Employees WHERE Active=1 --2) Calculate max salary for retirement purposes UPDATE E SET E.MaxSalary=MS.MaxSalary FROM EmployeeReview E JOIN (SELECT ID, MAX(Pay) AS MaxSalary FROM EmployeePayHistory GROUP BY ID) MS ON MS.ID=E.EmployeeID --3) An employee is eligible for executive retirement if they have been employed for 10 years or more -- AND they have once made over 100,000 UPDATE EmployeeReview SET Retirement='Executive' WHERE DATEDIFF(YEAR, HireDate, GETDATE()) >= 10 AND MaxSalary > 100000
The readable section of code turns this job into three distinct steps, each designed to be as independent as possible. A developer can quickly see that three actions take place, and can see what each action is supposed to accomplish. It also could use some improvements. The three updates on the report table mean disk is written to three times. The subquery for calculating max salary considers all employees, not just the active ones specified in step one.
The next code section was written to improve performance.
/********************* Performance-leaning *********************/ INSERT INTO EmployeeReview(EmployeeID, FirstName, LastName, HireDate, Salary, MaxSalary, Retirement) SELECT ID AS EmployeeID ,[First] AS FirstName ,[Last] AS LastName ,StartDate AS HireDate ,Pay AS Salary ,MAX(Pay) AS MaxSalary ,CASE WHEN DATEDIFF(YEAR, HireDate, GETDATE()) >= 10 AND MAX(Pay) > 100000 THEN 'Executive' ELSE 'Basic' END AS Retirement ,SUM(EPC.Amount) AS PaidToDate FROM Employees E JOIN EmployeePayHistory EPH --If I can't find a way to make this join 1:1, I'm in trouble ON EPH.ID=E.ID JOIN EmployeePayChecks EPC ON EPC.ID=E.ID WHERE Active=1 AND EPH.MaxSal=1 --In order to add paid-to-date, I also need to understand the pay history table GROUP BY ID ,[First] ,[Last] ,StartDate ,Pay
By doing everything in a single statement, we eliminate the two performance complaints mentioned above. How long did it take to understand this statement? Even with proper commenting, it would take longer for a developer to understand how this statement meets the requirements. Making even a simple change, such as changing how max salary is obtained or adding new columns would require the entire job to be modified and retested.
In both cases, the desired results are returned to the user, so how do we decide which is “better”?
Take all the freebies
Just as there is no room for compromising accuracy, finding the right balance between readability and performance is never an excuse to write lazy or poor performing code. Take any free gains you can. The readable section can have its step 2 filter for active employees, with a brief comment explaining why. The performant section can have a block of comments explaining the requirements and logic. Both sections can benefit from appropriate use of indexes. And of course, follow recommended practices where they make sense.
What do you have a lot of, and what do you not have much of?
What is the time window that this code has to complete? Imagine you’re running this at night during a window of 4 hours, and all jobs including this report take less than 45 minutes to complete. Then performance doesn’t seem so critical. Instead, imagine that this report is part of a live dashboard and is refreshed every 5 minutes during the busy day. Now every second counts. If this report throws an error when you’re out-of-office, then a teammate being able to fix it on their own means you don’t get a phone call. If this report runs as part of a sales demo, you may want to show the customer how “lightning fast” it is.
How likely is the business logic to change?
Pretend this report has been running for over a month and you’ve moved on to another project. Then the user tells you they want to see total paid to date, which sums a value from a paycheck table. For the readable code, you could simply add a step 4. For the performant code, you would have to modify the whole statement, rewriting your join to pay history to ensure one-to-one cardinality.
/********************* Readability-leaning Modification*********************/ --Steps 1-3 are unchanged --4) Add total paid to date UPDATE E SET E.PaidToDate=EPC.PaidToDate FROM EmployeeReview E JOIN (SELECT ID, SUM(Amount) AS PaidToDate FROM EmployeePayChecks GROUP BY ID) EPC ON EPC.ID=E.EmployeeID /********************* Performance-leaning Modification*********************/ INSERT INTO EmployeeReview(EmployeeID, FirstName, LastName, HireDate, Salary, MaxSalary, Retirement) SELECT ID AS EmployeeID ,[First] AS FirstName ,[Last] AS LastName ,StartDate AS HireDate ,Pay AS Salary ,MAX(Pay) AS MaxSalary ,CASE WHEN DATEDIFF(YEAR, HireDate, GETDATE()) >= 10 AND MAX(Pay) > 100000 THEN 'Executive' ELSE 'Basic' END AS Retirement ,SUM(EPC.Amount) AS PaidToDate FROM Employees E JOIN EmployeePayHistory EPH --If I can't find a way to make this join 1:1, I'm in trouble ON EPH.ID=E.ID JOIN EmployeePayChecks EPC ON EPC.ID=E.ID WHERE Active=1 AND EPH.MaxSal=1 --In order to add paid-to-date, I also need to understand the pay history table GROUP BY ID ,[First] ,[Last] ,StartDate ,Pay
When we write code that is easy to understand, we write code that can be easily modified. When we write code that is optimized for performance but harder to understand, we commit extra time to the next modification, by way of deciphering and regression testing. An extra day or two is acceptable to the user for code that will only be modified once a year. For code that needs to be modified every couple weeks, even a few hours is painful.
How complex is the business logic?
In the example, there are only three distinct steps to achieve the results, which can be optimized to less than 20 lines in a single select statement. Many tasks in SQL Server, such as Data warehousing and self-serve report tables could easily grow to 20-30 steps each. I’ve seen performance optimization attempts for these jobs lead to developers asking for comically-oversized monitors just so they could see one command in its entirety. However, simple tasks like displaying a menu or showing recommended indexes only read from pre-built tables and views.
Finding the right fit
I encourage all of us to avoid seeing working code as good or bad, but instead as a component to a system. That system includes its objective, the machines on which it executes, and the people who work to maintain it. Then, developing code means fitting into this system as well as possible. How do you balance code between readability and performance? Please comment and let me know what you think.