SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Generating Dummy Data


Generating Dummy Data

Author
Message
Jamie Ingram-729524
Jamie Ingram-729524
Old Hand
Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)

Group: General Forum Members
Points: 319 Visits: 424
Comments posted to this topic are about the item Generating Dummy Data

We are the pilgrims, master.
We shall go always, a little further.

Tony Savoie
Tony Savoie
SSC Rookie
SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)SSC Rookie (40 reputation)

Group: General Forum Members
Points: 40 Visits: 1037
I've used this tool in the past for the same type of thing, as well as anonymizing data for testing:

http://www.matturbanowski.co.uk/?page=AnonymousData

It works well and is cost efficient Wink
sqlprof123
sqlprof123
SSC Rookie
SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)SSC Rookie (38 reputation)

Group: General Forum Members
Points: 38 Visits: 91
it's a nice article. thanks.

i use the Data Generation Plan (DGP) feature available out of Visual Studio. it's much easy and no script-writing..any pros of using cross joins over DGPs?
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)

Group: General Forum Members
Points: 86816 Visits: 41103
Bhushan-594968 (8/22/2013)
it's a nice article. thanks.

i use the Data Generation Plan (DGP) feature available out of Visual Studio. it's much easy and no script-writing..any pros of using cross joins over DGPs?


It depends, I suppose. I don't use DGP for such a thing (actually, I don't bother with Visual Studio, either :-P) so I can't say much about DGPs but I can ask a couple of questions. How fast are DGPs? For example, if you want to create a transaction table with a million rows, how long does it take to do the setup (including the table layout) and how long does it take to run? How easy is it to specify a range of transaction dates, how many "customers" you want to include, and how many transactions to include for each customer? Can it setup more or less realistic data when it comes to things like names or does it use GUIDs as so many other data generators seem to do?

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
rossmcneely
rossmcneely
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 1
Thanks for the article. I will be trying it out in a couple of days.
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)

Group: General Forum Members
Points: 86816 Visits: 41103
Hi Jamie,

Despite the low number of stars on this article, I think the concept of this article is actually very good but the methods used in the code bit you.

Before we start, I have to break out my soapbox a bit and this isn’t directed at you. It’s directed at the people who gave this article a low rating. If you look at the rest of the posts in this discussion, so far, they are only from people that either thought the article was ok or that have identified that they’ve used another tool. It would appear that none of the people that gave this article low marks could find even a minute to offer any constructive criticism to help a fellow member of the SQL Community at large on an article that is conceptually good but has some technical difficulties. I realize that many of us simply don’t have the time to make a helpful comment on every article that we give a low grade to but no one? I truly hope that’s not what this community is becoming.

Ok… stepping off the soapbox…

The end result of this article is that it makes a predictable and stable pair of test tables. You need something that's predictable and stable to easily support unit testing, especially when you first start coding. It also takes you step by step through the process and, except for creating a test database (which is no biggee), provides all of the code requiring no particular prior knowledge except for what a CROSS JOIN does and you provided that information with demonstrable code, as well. For what the article is about, it could, with some modification, be a good teaching tool for the basic concept of generating large amounts of test data.

I can see why some folks may have voted this article so low, though, and I thought I might give some insight. Of course, this is just my opinion. It’s not meant as any form of ridicule but rather as constructive criticism and suggestions to help you make future articles more successful in the areas of audience perception and understanding.

1. The first problem may be because of the following expectation that you setup in the article.
The purpose of this article is to provide a means of generating random data which can be inserted into a database for testing purposes.


I believe that people were expecting to see all of the data being produced to actually be random data. Instead, they see 10 hardcoded first names and 10 hardcoded last names. That’s just not “random data”.

I believe that setting the expectation of having predictable data and the reason why predictable data is such a good idea at the beginning of a code project would have gone a long way in setting the correct expectation. Something like…
When first venturing into the unknown at the beginning of a project, it’s important to have large amounts of reasonably predictable and repeatable data. Additionally, it needs to be relatively quick to set up and it needs to be even easier to reuse to regenerate clean test data to support multiple retests in the event the code changes the original test data. The data must also support the idea that names and addresses can be duplicated in the real world. For those reasons, the test data generated by the code in this article ISN’T totally random.


I think that would have gotten you an extra half or full star by itself because it sets a correct expectation in the reader’s eyes.

2. The next problem may be a problem that plagues many writers of SQL Server code-based articles. They start out with a wonderful concept and then show that they might not have done the necessary research by resorting to the use of WHILE loops (or recursive CTEs (rCTE) that “count”) to generate test data. In the case of your article, it’s a double whammy because it would appear the reason why you resorted to a WHILE loop is because of the way you generated the “Numbers” table and random data.

You have the right idea of using a “Numbers” table but you just got done explaining the power of CROSS JOINs. Even though it matters little for performance, in this case (300+ ms instead of 24 ms or less and used only once), using a WHILE loop to generate such a useful set-based tool as a “Numbers” table can be perceived as you not having done the proper research or maybe being less than qualified to write the article. There are at least 4 different comparatively high performance methods to create a “Numbers” table or function and none of them require the use of a loop or rCTE. In fact, all of them involve the use of a CROSS JOIN of one form or another. Since a good part of your article is based on the power of CROSS JOINs, it really leaves people wondering because you fell back on the “beginners” method of using a loop (RBAR) and didn’t use the power that you just explained.

The other whammy is also pretty big insofar as perception of your audience. Many people already know the power of a CROSS JOIN in the generation of data and are left wondering why you resorted to using a WHILE loop to generate the transaction data. Upon closer inspection, they also find that you’ve created a Scalar Function to support the generation of a properly constrained random number when a simple and very popular formula (indicating, again, that you may not have done the necessary research) would have done. Again, performance wise, it doesn’t matter much in this article but the use of the formula I’m talking about would have totally eliminated the need for both the WHILE loop and the Scalar Function.

As a bit of a side bar in this area, without the understanding that you’re generating mostly predictable data, people are left wondering why you have 20 identical transactions for 4 million customers. That limits the actual number of dates to a maximum of 20 and that’s not going to support performance testing for reports and the like in a broad enough scope.

For more information on the basic formula that I’m talking about and some of its permutations, please see the following articles.

http://www.sqlservercentral.com/articles/Data+Generation/87901/
http://www.sqlservercentral.com/articles/Test+Data/88964/

3. The next thing might pertain to knowing the scope of the audience. You used a function (SEQUENCE) that’s only available in 2012 in an article that has good generic application across many versions of SQL Server without showing an alternative. SQL Server 2012 is only a year or so old and there’s still a majority of people that are still “stuck” with 2005 and 2008. Showing how to use the IDENTITY function (which is available in all versions as far back as I can remember) would have been points in your favor.

The other perception problem in this area is that while you did show being “up to speed” by using a function available only in the latest production version of SQL Server, you didn’t use some of the more common basic functionality that some folks expect to see in the construction of data and the assignment of values to multiple variables. For example, you used multiple INSERT/VALUE statements instead of using SELECT/UNION ALL (all versions) or VALUES (2008+). You also used multiple adjacent SELECTs to assign values to multiple variables instead of using a single SELECT to do the same job. Again, performance wise, it’s not going to matter much for what the code is doing but it does seriously affect the impressions people walk away with.

4. Another serious perception breaker is based on some of the not-so-obvious errors in the write-up.

For example, you do explain how many rows will be produced and why but instead of the code generating the 8 million rows you said it would, it actually generates 80 million rows because you forgot to add in the additional CROSS JOIN you used to generate the street name. That “shows” people that you either didn’t pay attention to what you were writing or you didn’t test your own code enough to know what it actually does regardless of whether you actually did or not. If nothing else, it takes them by great surprise and they begin to wonder what else may be wrong with the code. Rather than taking the time to find out (people ARE busy), they give you a bad mark and flee.

In that same vein, you say “This step uses a CROSS JOIN to create 100 names”. Yes, it does do that but people are really taken back when that bit of code generates 4 million rows. It would have been better to say that the code “Generates 100 name combinations across thousands of combinations of address information for a total of 4 million rows". Again, the difference is quite subtle but can have a huge impact on people’s perception about the article. Of course, that may be a Phase II of the test data but it’s not explained anywhere and so people get the wrong perception.

To summarize, I hope the low marks on this article don’t turn you away from writing. You have the basic concepts of how to write and just need to take it to the next level. A bit more research on some of the tools you used to pull this article off would go a long way along with checking what the code returns and setting better expectations for what the code in the article will actually provide.

Thanks for taking the time to write the article and the time to read my suggestions. As “Red Green” would say, “We’re all in this together and I’m pullin’ for ya.”

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jamie Ingram-729524
Jamie Ingram-729524
Old Hand
Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)

Group: General Forum Members
Points: 319 Visits: 424
Thanks for the feedback Jeff. I appologise to the community for not checking my work more thoroughly.
My next arrticle will be of better quality.


Cheers

Jamie

We are the pilgrims, master.
We shall go always, a little further.

JJ B
JJ B
Mr or Mrs. 500
Mr or Mrs. 500 (527 reputation)Mr or Mrs. 500 (527 reputation)Mr or Mrs. 500 (527 reputation)Mr or Mrs. 500 (527 reputation)Mr or Mrs. 500 (527 reputation)Mr or Mrs. 500 (527 reputation)Mr or Mrs. 500 (527 reputation)Mr or Mrs. 500 (527 reputation)

Group: General Forum Members
Points: 527 Visits: 2859
Jeff: Thanks for your sensitive and helpful reply. Other people (like myself!) than just the author will get a lot of benefit from it.

I also second your encouragement to the author not to give up. I appreciated the article.
webrunner
webrunner
SSCarpal Tunnel
SSCarpal Tunnel (4.2K reputation)SSCarpal Tunnel (4.2K reputation)SSCarpal Tunnel (4.2K reputation)SSCarpal Tunnel (4.2K reputation)SSCarpal Tunnel (4.2K reputation)SSCarpal Tunnel (4.2K reputation)SSCarpal Tunnel (4.2K reputation)SSCarpal Tunnel (4.2K reputation)

Group: General Forum Members
Points: 4167 Visits: 3886
Great reply, Jeff. I wondered the same things (low number of stars, hard-coded data), but think this is a great topic and I applaud the author for taking a crack at it.

I would love to see this mature into a tool that can randomize or otherwise transform real production data into realistic, but safer, data for non-production use.

Thanks again,
webrunner

-------------------
"I love spending twice as long and working twice as hard to get half as much done!" – Nobody ever.
Ref.: http://www.adminarsenal.com/admin-arsenal-blog/powershell-how-to-write-your-first-powershell-script

"Operator! Give me the number for 911!" - Homer Simpson

"A SQL query walks into a bar and sees two tables. He walks up to them and says 'Can I join you?'"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html
Jamie Ingram-729524
Jamie Ingram-729524
Old Hand
Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)Old Hand (319 reputation)

Group: General Forum Members
Points: 319 Visits: 424
On other projects that I have worked on, the dev data has occasionally been scrubbed so that names and phone numbers and other account details are gibberish.

Eg John Smith 0406 320453 becomes Joh$ $m$th 9999 999999, in which case you lose a lot of value that could be obtained by analyzing the data anyway.

If this prompts someone to write a better tool then I'm happy.

Cheers

We are the pilgrims, master.
We shall go always, a little further.

Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search