Part 1 of a discussion about replacing cursors with SQL statements for significant speed improvements.
In a previous post I discussed where it is best to put logic – SQL or code. In my view, SQL is great for processing sets of data and poorly structured for row-based processing. To put it bluntly, I try to avoid cursors in most circumstances. Is this rule absolute? Not necessarily. Sometimes the operation cannot be done in a set-based way, and a cursor can be developed faster than a program. Sometimes it is known in advance the looping will only involve a few hundred records and so won’t take long to run. Sometimes you’re working with an existing cursor and just need to make a few minor changes – modifying the cursor in this scenario will be faster than rewriting it from scratch. But in general, I try to avoid using cursors when possible [and I'm always looking for alternative ways to code the logic].
I’ve also removed a number of cursors during my time at BPS, some more complicated than others. A cursor in particular that sticks out is one that was used to determine transportation eligibility for students. The cursor used the student’s information and updated the transportation eligibility code for each student – one student at a time. The logic did the following:
- Fetch the student’s grade level [Elementary, Middle, or High school] into a variable.
- Clear the transportation eligibility code if the student is within the walking distance to the school [walking distance is based, in part, on the students' grade level].
- If the student isn’t in a particular category [Special Ed, etc] and isn’t in a particular set of schools, set the eligibility code to ineligible if the student is out of zone [the code used is different than the one in #2]. Note: Boston is divided into 3 school assignment zones, and we generally do not provide transportation to students who attend an out of zone school, although there are exceptions.
- If the student is in a half-day kindergarten program, update the eligibility code to highlight the fact that the transportation will be for a half-day only.
- Update the transportation eligibility code if the student didn’t meet any of the other criteria [such a student is eligible for transportation].
The logic used in the cursor was a tangled web of update statements fit between many nested if statements with complicated logic. To make matters worse, for 56,000 students this process took over 20 minutes to run.
A couple years ago I decided enough was enough and spent some time replacing the cursor. After reviewing the cursor logic and getting to the heart of what the logic was doing, I was able to replace it with a sequence of 2 queries used to populate table variables and a final query used to update the transportation eligibility codes. All 3 queries are set-based and handle all students simultaneously. In general, the cursor was composed of many repetitive tasks done in a certain order. I kept the same ordering in the queries by using a case statement.
What was the end result of replacing the cursor with the set based SQL statements? The required runtime went down to just 10 or so seconds. From 20 minutes to 10 seconds? That’s over a 100 fold speed improvement!
I’m going to explore the SQL used [for both the cursor and the replacement] in more detail in my next post.