Introduction
On a particularly idle evening while randomly surfing Wikipedia I stumbled on a page about a mathematical problem called the Collatz conjecture. Now fear not Dear Reader, this is not an article about mathematics. It is an example case study in how I converted a plain English problem into T-SQL.
The wiki article describes the vast amount of study into this problem with the inference being that at any one time there are numerous guru mathematicians armed with a blistering arsenal of supercomputers hard at work trying to solve it. But the actual process is very simple and can be described in just a few sentences. This got me to thinking what it would take to build a T-SQL procedure as a proof of concept to test this particular problem, especially given that (1) it is in a field that I know little about, ie mathematics and (2) I don't even know if it is doable to use T-SQL for this. Thus I will set myself the task of writing a T-SQL script that tests the Collatz conjecture
Description of the Collatz conjecture
First let's look at a plain English description of the instructions for testing the Collatz conjecture. For reference, we'll call this "the sequence":
- think of a number (a positive integer)
- if the number is even, divide it by two
- if the number is odd, triple it and add one
- repeat using the result instead of thinking of a number
Here's an example:
- I'll start with 10 as my starting number
- (10 is even so I divide by 2) 10 / 2 = 5
- (the result (5) is odd so I triple it and add 1) 5 * 3 + 1 = 16
- (the result (16) is even so I divide it by 2) 16 / 2 = 8
- (the result (8) is even so I divide it by 2) 8 / 2 = 4
- (the result (4) is even so I divide it by 2) 4 / 2 = 2
- (the result (2) is even so I divide it by 2) 2 / 2 = 1
Once I get to a result of 1 I stop because the Collatz conjecture states that:
"This process will eventually reach a result of 1, regardless of which positive integer is chosen initially."
To prove the Collatz conjecture, one must find a way to show that all starting numbers will eventually reach a result of 1. To disprove it, one must find a value that never reaches a result of 1.
Interpreting the task
So we begin the task of converting the plain English description of the Collatz conjecture into tsql. Typically we are encouraged to put in a fair bit of planning before we write any code. Sketch out architecture, define variables, prepare test data, that sort of thing. These are usually good things to do but given that I do not know anything about mathematics and I do not know if I can even do this in T-SQL, I do not believe that I can perform these planning tasks with any degree of usefulness. Which brings me to the point of this article:
"sometimes it's ok to start coding code right away"
Contentious I know but hear me out and we'll see how we go. Having decided that I will not do the usual planning, I still need to figure out how I am going to approach this task. And I find that for a task like this, the best approach is:
"start from the middle and work outwards"
Sounds a bit simplistic? Perhaps yes but I find that with most tasks, it is helpful to ask myself 'what is at the heart of this task?'. That in turn can be pondered by considering questions like:
- What is the central operation of the task?
- What operation is most important to get right?
- What operation is the most difficult?
- What operation has the biggest impact if it is wrong?
Looking at the wording of the Collatz conjecture, I deduce that these are the highlights of what this task is about:
- a decision about a number being odd or even
- a calculation
- a repeat of the process
For this task, I make an assessment and I decide that it is the calculation that I should start with first. I will then enhance the code in iterations, adding the other operations as I progress. Each iteration will be runnable so I can see if I am on the right track.
The code
So my first piece of code is to perform a calculation on a number. I'll start with the number 10. Being even, the calculation I perform on it is to divide it by 2.
--Collatz conjecture test. iteration 1 declare @NumberToTest bigint = 10 set @NumberToTest = @NumberToTest / 2 select @NumberToTest
If I run that I get a result of 5. Being an odd number I then need to take the number 5, triple it and add 1:
--Collatz conjecture test. iteration 2 declare @NumberToTest bigint = 5 set @NumberToTest = (@NumberToTest * 3) + 1 select @NumberToTest
So far, so good. Now I'll look at the decision. To start, I'll make sql server determine if the number is even:
--Collatz conjecture test. iteration 3 declare @NumberToTest bigint = 10 if @NumberToTest%2 = 0 --evaluates true if @NumberToTest is even set @NumberToTest = @NumberToTest / 2 select @NumberToTest
or odd:
--Collatz conjecture test. iteration 4 declare @NumberToTest bigint = 5 if @NumberToTest%2 = 1 --evaluates true if @NumberToTest is odd set @NumberToTest = (@NumberToTest * 3) + 1 select @NumberToTest
Now I'll combine the two operations:
--Collatz conjecture test. iteration 5 declare @NumberToTest bigint = 10 if @NumberToTest%2 = 0 --even set @NumberToTest = @NumberToTest / 2 else if @NumberToTest%2 = 1 --odd set @NumberToTest = (@NumberToTest * 3) + 1 select @NumberToTest
Now for the repeat operation. I'll put in a loop that ends when a result of 1 is reached:
--Collatz conjecture test. iteration 6 declare @NumberToTest bigint = 10 while @NumberToTest <> 1 begin if @NumberToTest%2 = 0 --even set @NumberToTest = @NumberToTest / 2 else if @NumberToTest%2 = 1 --odd set @NumberToTest = (@NumberToTest * 3) + 1 end select @NumberToTest
There is not much output to tell me what has happened so I'll output the value each time the sequence is repeated. That means I can see the running value:
--Collatz conjecture test. iteration 7 set nocount on declare @NumberToTest bigint = 10 select 'starting number: ' + cast(@NumberToTest as varchar(50)) while @NumberToTest <> 1 begin if @NumberToTest%2 = 0 --even set @NumberToTest = @NumberToTest / 2 else if @NumberToTest%2 = 1 --odd set @NumberToTest = (@NumberToTest * 3) + 1 select @NumberToTest as Result end
According to the wiki article, one of the things I am interested in is the count of how many repetitions of the sequence it took to reach 1. This is called the total stopping time. So I'll put the results into a table and tally them at the end of the script:
--Collatz conjecture test. iteration 8 set nocount on declare @NumberToTest bigint = 10 select 'starting number: ' + cast(@NumberToTest as varchar(50)) if exists(select 1 from sys.objects where name = 'SequenceValues') drop table SequenceValues create table SequenceValues (SequenceValue bigint) while @NumberToTest <> 1 begin if @NumberToTest%2 = 0 --even set @NumberToTest = @NumberToTest / 2 else if @NumberToTest%2 = 1 --odd set @NumberToTest = (@NumberToTest * 3) + 1 insert into SequenceValues (SequenceValue) select @NumberToTest end select SequenceValue from SequenceValues select count (*) as TotalStoppingTime from SequenceValues
Life is too short to run the script for individual values so I'll get it to run the sequence for a range of numbers, not just one. To start with, I'll test the numbers from 10 to 20
--Collatz conjecture test. iteration 9 set nocount on declare @NumberRangeStart bigint = 10 declare @NumberRangeEnd bigint = 20 declare @TotalStoppingTime bigint declare @NumberToTest bigint if exists(select 1 from sys.objects where name = 'TotalStoppingTimes') drop table TotalStoppingTimes create table TotalStoppingTimes (StartingNumber bigint,TotalStoppingTime bigint) while @NumberRangeStart < @NumberRangeEnd begin if exists(select 1 from sys.objects where name = 'SequenceValues') drop table SequenceValues create table SequenceValues (SequenceValue bigint) set @NumberToTest = @NumberRangeStart select 'number to test: ' + cast(@NumberToTest as varchar(50)) while @NumberToTest <> 1 begin if @NumberToTest%2 = 0 --even set @NumberToTest = @NumberToTest / 2 else if @NumberToTest%2 = 1 --odd set @NumberToTest = (@NumberToTest * 3) + 1 insert into SequenceValues (SequenceValue) select @NumberToTest end select SequenceValue from SequenceValues select count (*) as TotalStoppingTime from SequenceValues set @NumberRangeStart = @NumberRangeStart + 1 end
But now I'm getting a bit swamped with the output so I'll select the summary results out of the table each time the sequence reaches one and only look at the total counts. To make it a bit more interesting, I'll try it with a range of larger numbers:
--Collatz conjecture test. iteration 10 set nocount on declare @NumberRangeStart bigint = 30000000000000 declare @NumberRangeEnd bigint = 30000000001000 declare @TotalStoppingTime bigint declare @NumberToTest bigint if exists(select 1 from sys.objects where name = 'TotalStoppingTimes') drop table TotalStoppingTimes create table TotalStoppingTimes (StartingNumber bigint, TotalStoppingTime bigint) while @NumberRangeStart < @NumberRangeEnd begin if exists(select 1 from sys.objects where name = 'SequenceValues') drop table SequenceValues create table SequenceValues (SequenceValue bigint) set @NumberToTest = @NumberRangeStart while @NumberToTest <> 1 begin if @NumberToTest%2 = 0 --even set @NumberToTest = @NumberToTest / 2 else if @NumberToTest%2 = 1 --odd set @NumberToTest = (@NumberToTest * 3) + 1 insert into SequenceValues (SequenceValue) select @NumberToTest end set @TotalStoppingTime = (select count(*) from SequenceValues) insert into TotalStoppingTimes(StartingNumber, TotalStoppingTime) values(@NumberRangeStart, @TotalStoppingTime) set @NumberRangeStart = @NumberRangeStart + 1 end select StartingNumber, TotalStoppingTime from TotalStoppingTimes
I now have a script that tests the Collatz conjecture on whatever range of numbers I give it. It is not exactly an efficient script but it has served the purpose of demonstrating that the task was achievable. My next step will be to rework it to be more efficient. I won't do that here as that is not the point of this article.
Note that my script does not actually prove or disprove the Collatz conjecture in the mathematical sense. It merely cycles through values on the assumption that the sequence will eventually reach a value of 1. If I happen to run it for a value that disproves the Collatz conjecture (ie a starting value that runs through the sequence but never reaches a value of 1), my only clue that this has happened will be a blank screen while my query window is busy running which is not helpful. A simple way to monitor its progress while running is to execute this script in another query window:
select max(StartingNumber) + 1 as NumberBeingTested from TotalStoppingTimes select count(*) as CountOfRepetitions from SequenceValues
The Wikipedia page says that the Collatz conjecture has already been tested using numbers much bigger than can be managed by SQL Server so there is no risk of my script putting any mathematicians out of a job. So I'll get back to the point of the article.
Back to the point
Remember that this article is a look at one example of how to convert plain English into T-SQL. For this particular task, I chose not to map out my variables or architecture in advance. I started with just a couple of lines of code that performed the operation that I decided is at the heart of the problem at hand. I then evolved that code by enhancing it in small iterations, each of which was runnable and produced its own results.
I find this a good way to build code, especially when I am building a solution for a problem where there is no certainty that tsql is the right tool for the job - a prototype or proof of concept. The results of each iteration are easily comparable to the results of the previous iteration which gives me a higher degree of confidence that my solution is correct as I build it.
It is a fair comment that developing code in iterations like this might not produce the most efficient code for the first few iterations. Indeed there are numerous opportunities for improvement of my script. The advantage is that having built a working prototype, it will not be much extra effort to rework it to be more efficient.
Conclusion
When converting a plain English problem into T-SQL, it may be effective to start coding right away by building the most central or important operation then enhancing the code by adding to it in iterations. This might especially be appropriate when building a prototype or proof of concept.