Fun with prime numbers and T-SQL
I do not get a chance to read much popular literature, but my daughter recently gave me the book The Curious Incident of the Dog in the Night-Time by Mark Haddon. Once I start reading a novel, I get so involved that I just keep reading until the book is finished. Luckily, this book contained just over 200 pages, and I completed the entire book one Sunday.
The interesting thing about this book is that the narrator and protagonist of the story is a 15 year old autistic boy. He is extremely gifted in math and physics but is unable to understand most human emotions or connect with the people close to him in the ways one would normally expect. The premise of the story is that the young hero is writing the book himself about solving the murder of a neighbor's dog.
Because the teen is so obsessed with math, he decides to number the chapters with prime numbers. A prime number is only evenly divisible by itself and one. He brings up prime numbers and the search for new prime numbers several times throughout the story. Since I am so passionate about SQL Server, I wondered how difficult it would be to create a list of prime numbers using T-SQL and how efficient the algorithm would be.
How to find prime numbers
The author described the search for primes as listing all possible numbers in order and then eliminating the numbers divisible by two, then by three, then by four, then by five, then by six, etc. I knew that I would only have to check to see if a prime number candidate was divisible by other prime numbers. For example, why check to see if a number is divisible by six? If a number is not divisible by two or three, then it will not be divisible by six, either.
I also needed to determine the maximum divisor to check for each prime number candidate to avoid doing extra work. My first thought was that I could stop testing divisors at one-third the value of the prime number candidate. My reasoning was that I would only consider odd numbers, and the first divisor to consider was three. When dividing by three I was also effectively testing one-third of the prime number candidate. After doing a bit of research about how to determine small prime numbers, I found that I could stop at the square root of the prime number candidate. This made sense to me when I realized that as the divisor increased, the quotient decreased. The two numbers would meet at the square root of the dividend. Consider the example in Listing 1.
Listing 1: Division operations to determine if 149 is prime
|149 / 2||74 r 1|
|149 / 3||49 r 2|
|149 / 5||29 r 4|
|149 / 7||21 r 6|
|149 / 11||13 r 6|
|149 / 13||11 r 6|
|149 / 17||8 r 13|
Notice that the quotient decreases in value as the divisor increases. At the point that the divisor reaches 13, the quotients are now in the range of divisors that have already been checked. The square root of 149 is between 12 and 13 so it is unnecessary to test additional divisors past 13. Since none of the remainders are zero, 149 is prime.
The modulo operator (%) returns the remainder from a division operation. When a zero is returned from a modulo operation, the dividend is evenly divided by the divisor. Conversely, if a non-zero value is returned from a modulo operation, the dividend is not evenly divided by the divisor. For a number to be prime, all modulo operations must return zero. If a non-zero value is returned by each and every modulo operation for a particular number, that number is prime.
The T-SQL Solution
The first step in my solution is to create a table (Listing 2) to hold the prime numbers and populate it with the first two primes: two and three.
CREATE TABLE [dbo].[Primes](
[Prime] [int] NOT NULL,
CONSTRAINT [PK_Prime] PRIMARY KEY CLUSTERED
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
Insert into Primes (Prime) values (2)
Insert into Primes (Prime) values (3)
|Listing 2: The Prime table|
The script, a stored procedure called usp_FindPrimes, determines the first number to check: two more than the maximum prime number in the table. Another variable is used to control how many prime numbers to locate each time the proc runs.
Because SQL Server works best with sets of data, I believed it would be more efficient to write a query comparing the current prime number candidate against the list of prime numbers already discovered. Each time through the loop, the prime candidate is incremented by two so that only odd numbers are considered. If all of the modulo operations return 0, the number is inserted into the prime table.
Listing 3 shows the code to create the stored procedure.
create proc [dbo].[usp_FindPrimes] as
-- Proc: usp_FindPrimes
-- Author: Kathi Kellenberger
declare @NextInt int--This is the next number to check
declare @Count int--Count how many primes to find
declare @BaseCount int--Used to initially check the table
--if the table is empty, add the two rows
Select @BaseCount = count(prime) from primes where prime in (2,3)
If @BaseCount <> 2 begin
Delete from primes where prime in (2,3)
insert into primes (prime) values (2)
insert into primes (prime) values (3)
--Always start with two more than the last
--prime number found
select @NextInt = max(Prime) + 2
set @Count = 0
--Search for the next 1000 prime numbers
while @Count < 1000 begin
--If no modulo operations result in zero,
--add the number to the list
--We only need to check if the number is
--less or greater than the square root the prime number candidate
if not exists(
where @NextInt % Prime = 0
and sqrt(@NextInt) >= Prime)
insert into Primes(Prime) select @NextInt
set @Count = @Count + 1
--Find the next odd number
set @NextInt = @NextInt + 2
select Prime from Primes
|Listing 3: The usp_FindPrimes stored procedure|
I was surprised at how quickly the proc ran, in about two seconds, on my laptop. Because the script always begins with a number two more than the maximum prime number in the table, running the proc additional times continues to build the prime number list. The duration increased very slowly as the table grew in size.
Is the solution accurate?
Even though I had an efficient stored procedure that seemed to work, I wondered if it was really accurate.
By importing the list of the smallest 1000 known prime numbers and comparing them to my
first set of results,
I was able to verify that my result set matched the first 1000 known prime numbers.
My table contained 1002 rows after running the procedure once (remember it was seeded with two and three) and 1000 rows were returned when I joined
on the table of known primes. The KnownPrime table is included in the downloadable file. Listing 4 shows the query which returned 1000 rows.
select Prime, KnownPrime
from Primes join KnownPrimes
on Prime = KnownPrime
|Listing 4: Checking the results|
What about a cursor?
I was curious to see if there would be a big loss of performance by using a cursor based approach. To test the idea, I created a second stored procedure, usp_FindPrimes_Cursor. This procedure uses a cursor to perform each modulo operation individually instead of all of them at once for each prime number candidate. Maybe I should not have been surprised when the cursor method took 15 seconds the first time it ran on my laptop. Subsequent trials took even longer as more rows were added to the Primes table and the cursor had more rows to loop through. Once again, the set based approach consistently performed much better than a cursor. The code for usp_FindPrimes_Cursor is also included in the download file.
T-SQL is a very rich language that can be used to solve many interesting problems besides retrieving and manipulating data. While this example will not help discover any previously unknown prime numbers, the largest known prime numbers are thousands of digits long and some mathematicians devote their careers to the task, it demonstrates a creative approach to solving a problem that would be difficult to solve manually. It also demonstrates that SQL Server typically performs best when operating on sets of data versus the use of a cursor.