This editorial was originally published on 25 May 2015. It is being re-published as Steve is out of the office.
The scientific method is a key aspect of modern experimental and observational sciences. In academia, presenting a statement as fact without citation or proof would be enough to get a paper thrown out. Maybe it's time we, in IT, started holding ourselves to the same standards.
At its core, the Scientific Method of investigation is as follows:
- Define a hypothesis, a theory as to how something works or behaves.
- Design experiments to test whether or not the hypothesis is true
- Conduct the experiments, capture data and analyze.
- Either revise the hypothesis and repeat the process, or conclude on the basis of the data that the hypothesis is either true or false. Proving a hypothesis false, is an equally valid and important result as proving one true. A negative outcome in no way implies "failure" on the part of the hypothesizer.
So why, when we work in a field that is so conducive to experimentation, do so many people stop after step 1? I've lost count of the number articles, blog posts and forum posts I've seen that stated some hypothesis as if it was absolute truth, without doing any testing at all. Personally, I'd like to see a renewed focus on verifying what we say, either by testing it or by referencing someone who already did the testing. In other words, "cite it or prove it".
Sounds grand in theory, but what does it mean in practice? For example, let's say that I intend to write a blog post based on the hypothesis that "Joins on string columns perform the same as joins on integer columns."
The first step, post hypothesis, is to design the experiment. Our test requires two pairs of tables (parent and child), one pair with join columns using the integer data type, one with a string data type. It's important to fix as many variables as possible. That means the same server, the same database version and no other user load (i.e. an isolated Azure VM or a SQL instance on a desktop or laptop). It also means the data types chosen for each pair of tables should be the same size. Integers are 4 bytes, so the equivalent string would be char(4). Ideally, we'd put the same values in both columns. However, that would limit the test data set to 9999 rows, as '10000' doesn't fit into a char(4). To fix that, we'd change the integer column to a bigint and the string column to a char(8) and stick a few million rows into the child table and put the distinct list of the join column into the parent table.
Step 3 is to run the tests. One test is no good. Run each test multiple times to rule out any unexpected influences on the results. The easiest way to use "GO 25" and record the performance characteristics of each test, using Extended Events to capture duration and CPU (reads will be the same if we got the row sizes the same, and if we didn't that's a flaw in the experiment), and then average the results.
Finally, we have the data that will prove the hypothesis true or false. In this case, the results indicate that the hypothesis is false. Joins on string columns are slower than joins on integers and the blog post will show the tests and our conclusions.
Yes, the Scientific Method is a lot more work than making unsubstantiated claims and hitting publish, but it's also a lot more likely to produce a post that's useful, and less likely to attract a pile of comments refuting it.
So, who's in?
Gail Shaw (Guest Editor).