Despite the low number of stars on this article, I think the concept of this article is actually very good but the methods used in the code bit you.
Before we start, I have to break out my soapbox a bit and this isn’t directed at you. It’s directed at the people who gave this article a low rating. If you look at the rest of the posts in this discussion, so far, they are only from people that either thought the article was ok or that have identified that they’ve used another tool. It would appear that none of the people that gave this article low marks could find even a minute to offer any constructive criticism to help a fellow member of the SQL Community at large on an article that is conceptually good but has some technical difficulties. I realize that many of us simply don’t have the time to make a helpful comment on every article that we give a low grade to but no one? I truly hope that’s not what this community is becoming.
Ok… stepping off the soapbox…
The end result of this article is that it makes a predictable and stable pair of test tables. You need something that's predictable and stable to easily support unit testing, especially when you first start coding. It also takes you step by step through the process and, except for creating a test database (which is no biggee), provides all of the code requiring no particular prior knowledge except for what a CROSS JOIN does and you provided that information with demonstrable code, as well. For what the article is about, it could, with some modification, be a good teaching tool for the basic concept of generating large amounts of test data.
I can see why some folks may have voted this article so low, though, and I thought I might give some insight. Of course, this is just my opinion. It’s not meant as any form of ridicule but rather as constructive criticism and suggestions to help you make future articles more successful in the areas of audience perception and understanding.
1. The first problem may be because of the following expectation that you setup in the article.
The purpose of this article is to provide a means of generating random data which can be inserted into a database for testing purposes.
I believe that people were expecting to see all of the data being produced to actually be random data. Instead, they see 10 hardcoded first names and 10 hardcoded last names. That’s just not “random data”.
I believe that setting the expectation of having predictable data and the reason why predictable data is such a good idea at the beginning of a code project would have gone a long way in setting the correct expectation. Something like…
When first venturing into the unknown at the beginning of a project, it’s important to have large amounts of reasonably predictable and repeatable data. Additionally, it needs to be relatively quick to set up and it needs to be even easier to reuse to regenerate clean test data to support multiple retests in the event the code changes the original test data. The data must also support the idea that names and addresses can be duplicated in the real world. For those reasons, the test data generated by the code in this article ISN’T totally random.
I think that would have gotten you an extra half or full star by itself because it sets a correct expectation in the reader’s eyes.
2. The next problem may be a problem that plagues many writers of SQL Server code-based articles. They start out with a wonderful concept and then show that they might not have done the necessary research by resorting to the use of WHILE loops (or recursive CTEs (rCTE) that “count”) to generate test data. In the case of your article, it’s a double whammy because it would appear the reason why you resorted to a WHILE loop is because of the way you generated the “Numbers” table and random data.
You have the right idea of using a “Numbers” table but you just got done explaining the power of CROSS JOINs. Even though it matters little for performance, in this case (300+ ms instead of 24 ms or less and used only once), using a WHILE loop to generate such a useful set-based tool as a “Numbers” table can be perceived as you not having done the proper research or maybe being less than qualified to write the article. There are at least 4 different comparatively high performance methods to create a “Numbers” table or function and none of them require the use of a loop or rCTE. In fact, all of them involve the use of a CROSS JOIN of one form or another. Since a good part of your article is based on the power of CROSS JOINs, it really leaves people wondering because you fell back on the “beginners” method of using a loop (RBAR) and didn’t use the power that you just explained.
The other whammy is also pretty big insofar as perception of your audience. Many people already know the power of a CROSS JOIN in the generation of data and are left wondering why you resorted to using a WHILE loop to generate the transaction data. Upon closer inspection, they also find that you’ve created a Scalar Function to support the generation of a properly constrained random number when a simple and very popular formula (indicating, again, that you may not have done the necessary research) would have done. Again, performance wise, it doesn’t matter much in this article but the use of the formula I’m talking about would have totally eliminated the need for both the WHILE loop and the Scalar Function.
As a bit of a side bar in this area, without the understanding that you’re generating mostly predictable data, people are left wondering why you have 20 identical transactions for 4 million customers. That limits the actual number of dates to a maximum of 20 and that’s not going to support performance testing for reports and the like in a broad enough scope.
For more information on the basic formula that I’m talking about and some of its permutations, please see the following articles.http://www.sqlservercentral.com/articles/Data+Generation/87901/http://www.sqlservercentral.com/articles/Test+Data/88964/
3. The next thing might pertain to knowing the scope of the audience. You used a function (SEQUENCE) that’s only available in 2012 in an article that has good generic application across many versions of SQL Server without showing an alternative. SQL Server 2012 is only a year or so old and there’s still a majority of people that are still “stuck” with 2005 and 2008. Showing how to use the IDENTITY function (which is available in all versions as far back as I can remember) would have been points in your favor.
The other perception problem in this area is that while you did show being “up to speed” by using a function available only in the latest production version of SQL Server, you didn’t use some of the more common basic functionality that some folks expect to see in the construction of data and the assignment of values to multiple variables. For example, you used multiple INSERT/VALUE statements instead of using SELECT/UNION ALL (all versions) or VALUES (2008+). You also used multiple adjacent SELECTs to assign values to multiple variables instead of using a single SELECT to do the same job. Again, performance wise, it’s not going to matter much for what the code is doing but it does seriously affect the impressions people walk away with.
4. Another serious perception breaker is based on some of the not-so-obvious errors in the write-up.
For example, you do explain how many rows will be produced and why but instead of the code generating the 8 million rows you said it would, it actually generates 80 million rows because you forgot to add in the additional CROSS JOIN you used to generate the street name. That “shows” people that you either didn’t pay attention to what you were writing or you didn’t test your own code enough to know what it actually does regardless of whether you actually did or not. If nothing else, it takes them by great surprise and they begin to wonder what else may be wrong with the code. Rather than taking the time to find out (people ARE busy), they give you a bad mark and flee.
In that same vein, you say “This step uses a CROSS JOIN to create 100 names”. Yes, it does do that but people are really taken back when that bit of code generates 4 million rows. It would have been better to say that the code “Generates 100 name combinations across thousands of combinations of address information for a total of 4 million rows
". Again, the difference is quite subtle but can have a huge impact on people’s perception about the article. Of course, that may be a Phase II of the test data but it’s not explained anywhere and so people get the wrong perception.
To summarize, I hope the low marks on this article don’t turn you away from writing. You have the basic concepts of how to write and just need to take it to the next level. A bit more research on some of the tools you used to pull this article off would go a long way along with checking what the code returns and setting better expectations for what the code in the article will actually provide.
Thanks for taking the time to write the article and the time to read my suggestions. As “Red Green” would say, “We’re all in this together and I’m pullin’ for ya.”
is pronounced ree-bar and is a Modenism for R
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Helpful Links:
How to post code problemsHow to post performance problemsForum FAQs