The Scientific Method: a call to action

  • I'm in -- except I wouldn't want everybody doing these tests to prove something that has already been proved.

    I would be more inclined to have a mix of best practices and scientific practices. If someone needs to prove an int is faster than a string in a join (in your organization) because there is a debate, then prove it. Then make it a best practice so that you don't have to keep proving it over and over.

    Also, if you are debating join performance of a string vs. an int, maybe your just having the wrong conversation altogether. Maybe the approach ought to be to ask "why do you want a string vs. an int for your key?". The answer might be "because I don't have to join to "Customer" to get CustomerName" when querying "Orders" for a specific use case.

    If this is the answer, then the root of the concern is not necessarily string vs. int. The question might be "to join or not to join".

    So the complexion of the concern might be something really different than how the individual is attempting to express his or her concern.

  • I'm right there with you, Gail. I've seen so many posts and articles on this and other sites that claim one performance fact or another and a great many of them are flat out wrong. I've also seen people that post execution plans as "proof of performance" based on 10 rows. Neither is good. There has to be a run, it has to be measured, and the correct tools have to be used to do the measurements (for example, never use SET STATISTICS to measure performance of a scalar or multi-statement UDF).

    You points about datatypes are also spot on. There have been a number of posts and articles that claim that (for example) a While loop beats certain set-based techniques and, even with supposed "proof" in the form of large dataset testing, the experiment itself was flawed and weighed heavily in favor of the While loop because of datatype mismatches.

    I'll also add that one cannot just say "I tested it and here are the results". If we all believed in such claims, we'd all believe that cold fusion was a reality. The test absolutely MUST be repeatable by others and that means the data, the code, and the test harness MUST be published.

    To wit, the words "It's a best practice to..." are some of my least favorite words. My take on those words is frequently "Just because a million people are doing it, doesn't make it a best practice because they could all be doing it wrong". A great example of that can be found in the posts that use a Recursive CTE that increments some value. It's frequently slower and more resource intensive than a While loop yet there are hundreds of posts that outright claim or strongly infer that it's a best practice especially if there are "only a small number of rows" (one of my other least favorite expressions).

    Shifting gears a bit, one of the greatest advantages of using the scientific method and posting the proof in the form of runnable code is that you can learn a lot from the community. For example, I wrote an article about a method to remove duplicated spaces that blew the doors off several While loop methods. Someone else pointed out the nested REPLACE method that blew the doors off my method in the article and then proved it with the very same million row test table that I'd created for the article.

    Yeah... I'm with you... "Cite it or prove it" and if you can't prove what's been cited (we all know that everything on the internet, especially MS articles and articles written by MVPs and other experts, are always 100% correct, right? :sick:), then it's time to get busy and make a real proof. So I'll modify the "Cite it or prove it" and add that some good bit of research should be done before one cites something written by one of the proverbial cool kids.

    Yes, there are certain people out there that I have an inherent trust for in what they write. For example, Paul Randal's and Kimberly Tripp's articles are some of the most trustworthy articles out there and it's not because they were former MS employee's or wrote parts of SQL Server(although that does help). It's because most of their articles PROVE what they have said and they've coughed extensive information on how the readers can prove it to themselves. Hear-say does not constitute a fact or best practice no matter how often it's repeated.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • meilenb (5/23/2015)


    I'm in -- except I wouldn't want everybody doing these tests to prove something that has already been proved.

    That's where the "Cite it" part of "Cite it or prove it" comes in. And, like I said, one has to make sure that what they're citing is actually correct.

    As a bit of a sidebar, though, someone attempting a re-proof may actually come up with a better method. That's one of the very strong benefits of the Scientific Method.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • meilenb (5/23/2015)


    Also, if you are debating join performance of a string vs. an int, maybe your just having the wrong conversation altogether.

    Why?

    Is there no value in knowing that there is (or is not) a difference in performance when joining on data type 1 vs data type 2?

    Maybe the approach ought to be to ask "why do you want a string vs. an int for your key?".

    That's a completely different question with it's own discussion, pros, cons, experiments, results and conclusions. If I ask the first, you can't mention benefit not having to join at all and still answer the first question

    "Should I have pizza or cheesburgers for lunch?"

    "Junk food is unhealthy, you need to make an appointment with your doctor ASAP"

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Great post Gail and thank you for making the effort!

    Far too often (now echoing Jeff) tests are shaped to the expected results regardless of the actual results (if that makes (any) sense). The bottom line in that one cannot force anyone to conform to a standard or a good practice, one can only try to lead by giving a good example, which by the way you Gail, Jeff and so many others are constantly doing, hats off to all of you!

    😎

  • My Point is that it "could" be a completely different conversation.

    Thanks for making my point without understanding it.

  • meilenb (5/23/2015)


    Thanks for making my point without understanding it.

    Would you be so kind to elaborate a little further on this, I for one don't think I fully understand your point here.

    😎

  • meilenb (5/23/2015)


    Also, if you are debating join performance of a string vs. an int, maybe your just having the wrong conversation altogether. Maybe the approach ought to be to ask "why do you want a string vs. an int for your key?". The answer might be "because I don't have to join to "Customer" to get CustomerName" when querying "Orders" for a specific use case.

    If this is the answer, then the root of the concern is not necessarily string vs. int. The question might be "to join or not to join".

    If you've been hanging around forums long enough, then you know that the question of "does it actually need to be a sting" is normally the first question. When someone asks, "Why does that matter?", the first answer out of most people's mouths is "because integers are faster for lookups" and they offer no proof and THAT's the point that Gail was posting about. 😉

    The article certainly isn't about CHAR v.s. VARCHAR v.s. INT PKs. She wisely picked an example that's not only common and, sometimes, hotly contested, but one that's incredibly simple and easy to understand by people that might have no clue as to how to use the Scientific Method.

    And, the article isn't necessarily about all the questions that led up to the normally missing proof that requires the Scientific Method.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I completely agree that there needs to be more of the scientific method applied to claims/questions about SQL Server, whether general (like Gail's example of joining on a string compared to joining on an integer), or more specific (e.g., will making a particular change to this stored procedure result in shorter execution time, and if so, is it at the cost of increased CPU, etc.?).

    Now if I could get people on board with that AND with not testing in production, I'd be incredibly happy 🙂

  • Jacob Wilkins (5/23/2015)


    .. with not testing in production, I'd be incredibly happy 🙂

    +100

    😎

  • meilenb (5/23/2015)


    My Point is that it "could" be a completely different conversation.

    It is a completely different conversation, because INT doesn't automatically mean artificial key and String doesn't automatically mean natural key. I've seen strings used as artificial keys (most commonly in the date dimension of a data warehouse), I've seen joins on date columns because people want to use them as natural keys and dates *should* behave more like ints than strings (though I haven't tested it and hence can't say for sure)

    Thanks for making my point without understanding it.

    Oh I understood your point (and thank you, no one's tried to insult me yet today), I just don't agree with it. Besides, the example of a scenario that can be tested was not the point of the article. It was an example of how to go about it. I could just as well have discussed a test on whether or not full backups break the log chain (they don't[/url]) or whether clustered indexes are best for range scans (they're not[/url]) without changing the point I was making.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Jacob Wilkins (5/23/2015)


    (e.g., will making a particular change to this stored procedure result in shorter execution time, and if so, is it at the cost of increased CPU, etc.?).

    All I'll say on that is, watch for my next article. 😉

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I don't require you to agree with me.

    I don't require you to be right either.

  • That's a great editorial, and I totally agree with it. If everyone based design and build of systems only on verified (or at least adequately tested) hypotheses we would suffer a lot less from bugs and other unintended behaviour, the cost of software/system development and support would be significantly reduced, and it is really crazy to push out assertions without any supporting evidence, which ultimately has to be text and measurement based. Unfortunately, while many developers, dbas, and engineers understand that, there are a lot of managers out there who either have forgotten it or never understood it and drive their staff into forgetting it.

    But the scientific method doesn't generally lead to proofs - the hypotheses are falsifiable by test (or at least they were until string theorists got hold of physics) but a failure to falsify merely means that the predictions the hypothesis leads to are good for certain experiments and only within the limits of accuracy of measurement, not that it's some sort of absolute truth. Many experiments may be needed to give a decent degree of confidence, and even after that we may hit circumstances where the hypothesis turns out not to fit - for example Newtonian mechanics survived expirement for centuries, but is now known to be false (although it's still valid for many purposes) because it doesn't work for things on the atomic or smaller scale and it doesn't work on the astronomical scale (gets the orbit of Mercury wrong, for example).

    Yes, I prefer to prove things - I'm still a mathematician at heart; and yes, I can prove some statements about performance because they may be simple statements of pure mathematics. But there are far more performance statements where all I can prove is that there's a decent probability that a particluar algrithm will give better performance than another, because which performs best often depends on what data is fed to them and what environment they run in, and in some cases I can't even do that to any useful extent (in which case I'd prefer not to produce any software for that problem). But I can formulate hypotheses and test them and publish the hypotheses and the tests even though by doing that I get no proof of correctness, only either a better degree of confidence or a proof of incorrectness - which is of course what happens in real science (as opposed to in maths and logic and string theory) and is what the scientific method is (or perhaps used to be) all about.

    And sometimes I want to throw some software together in order to get some measurements in order to formulate some hypotheses - experimentation to allow hypotheses to be formulated is just as much part of the scientific method as is experiment to test, and is clearly applicable in the computing world. "I did this and got these results and can't make head or tail of them" is a perfectly reasonable thing to say in a scientific paper too, and if computing and/or software engineering are science based that sort of paper must be allowed too, not just papers describing hypotheses and the experiments carried out to check them.

    Tom

  • I think managers are operating on a different set of mathematics. One formula in particular:

    ROI = (Gain from Investment - Cost of Investment) / Cost of Investment

    Yes, I hear the "to be" replies (It costs less to catch a flaw up front than to fix it later, etc.). But if you don't get the product out the door you will never have any money to fix it later. Successful people don't wait around while scientists argue the merits of their experiments.

    Everyone knows that CO2, CH4 and N2O are factors in climate change. CH4 (refrigerants) in particular is a 21 multiplier to CO2, R-508B (refrigerant) is a 10,350 multiplier, etc. etc. yet there are still thousands of scientists who argue climate change is not human caused, that it is the natural cycle of the earth. Ice sheets are breaking off, sea level rises, on and on...

    We may very well see the end of civilization while 2% of climate scientists banter about on the science of climate change.

    Sometimes we just need to take the data and combine it with common sense, make some hard decisions, and move on.

    It is in this spirit that I agree with everything Crazy 8's says.

Viewing 15 posts - 1 through 15 (of 168 total)

You must be logged in to reply to this topic. Login to reply