Generating Dummy Data

Question

Post reply

Generating Dummy Data

Jamie Ingram-729524

SSCommitted

Points: 1575
More actions
August 21, 2013 at 10:30 pm

#298060

Comments posted to this topic are about the item Generating Dummy Data
We are the pilgrims, master.
We shall go always, a little further.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Tony Savoie SSChasing Mays Points: 630 More actions · Answer 1

I've used this tool in the past for the same type of thing, as well as anonymizing data for testing:

http://www.matturbanowski.co.uk/?page=AnonymousData

It works well and is cost efficient 😉

sqlprof123 SSC Eights! Points: 832 More actions · Answer 2

it's a nice article. thanks.

i use the Data Generation Plan (DGP) feature available out of Visual Studio. it's much easy and no script-writing..any pros of using cross joins over DGPs?

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 3

Bhushan-594968 (8/22/2013)
it's a nice article. thanks.
i use the Data Generation Plan (DGP) feature available out of Visual Studio. it's much easy and no script-writing..any pros of using cross joins over DGPs?

It depends, I suppose. I don't use DGP for such a thing (actually, I don't bother with Visual Studio, either :-P) so I can't say much about DGPs but I can ask a couple of questions. How fast are DGPs? For example, if you want to create a transaction table with a million rows, how long does it take to do the setup (including the table layout) and how long does it take to run? How easy is it to specify a range of transaction dates, how many "customers" you want to include, and how many transactions to include for each customer? Can it setup more or less realistic data when it comes to things like names or does it use GUIDs as so many other data generators seem to do?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

rossmcneely SSC Journeyman Points: 81 More actions · Answer 4

Thanks for the article. I will be trying it out in a couple of days.

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 5

Hi Jamie,

Despite the low number of stars on this article, I think the concept of this article is actually very good but the methods used in the code bit you.

Before we start, I have to break out my soapbox a bit and this isn’t directed at you. It’s directed at the people who gave this article a low rating. If you look at the rest of the posts in this discussion, so far, they are only from people that either thought the article was ok or that have identified that they’ve used another tool. It would appear that none of the people that gave this article low marks could find even a minute to offer any constructive criticism to help a fellow member of the SQL Community at large on an article that is conceptually good but has some technical difficulties. I realize that many of us simply don’t have the time to make a helpful comment on every article that we give a low grade to but no one? I truly hope that’s not what this community is becoming.

Ok… stepping off the soapbox…

The end result of this article is that it makes a predictable and stable pair of test tables. You need something that's predictable and stable to easily support unit testing, especially when you first start coding. It also takes you step by step through the process and, except for creating a test database (which is no biggee), provides all of the code requiring no particular prior knowledge except for what a CROSS JOIN does and you provided that information with demonstrable code, as well. For what the article is about, it could, with some modification, be a good teaching tool for the basic concept of generating large amounts of test data.

I can see why some folks may have voted this article so low, though, and I thought I might give some insight. Of course, this is just my opinion. It’s not meant as any form of ridicule but rather as constructive criticism and suggestions to help you make future articles more successful in the areas of audience perception and understanding.

1. The first problem may be because of the following expectation that you setup in the article.

The purpose of this article is to provide a means of generating random data which can be inserted into a database for testing purposes.

I believe that people were expecting to see all of the data being produced to actually be random data. Instead, they see 10 hardcoded first names and 10 hardcoded last names. That’s just not “random data”.

I believe that setting the expectation of having predictable data and the reason why predictable data is such a good idea at the beginning of a code project would have gone a long way in setting the correct expectation. Something like…

When first venturing into the unknown at the beginning of a project, it’s important to have large amounts of reasonably predictable and repeatable data. Additionally, it needs to be relatively quick to set up and it needs to be even easier to reuse to regenerate clean test data to support multiple retests in the event the code changes the original test data. The data must also support the idea that names and addresses can be duplicated in the real world. For those reasons, the test data generated by the code in this article ISN’T totally random.

I think that would have gotten you an extra half or full star by itself because it sets a correct expectation in the reader’s eyes.

2. The next problem may be a problem that plagues many writers of SQL Server code-based articles. They start out with a wonderful concept and then show that they might not have done the necessary research by resorting to the use of WHILE loops (or recursive CTEs (rCTE) that “count”) to generate test data. In the case of your article, it’s a double whammy because it would appear the reason why you resorted to a WHILE loop is because of the way you generated the “Numbers” table and random data.

You have the right idea of using a “Numbers” table but you just got done explaining the power of CROSS JOINs. Even though it matters little for performance, in this case (300+ ms instead of 24 ms or less and used only once), using a WHILE loop to generate such a useful set-based tool as a “Numbers” table can be perceived as you not having done the proper research or maybe being less than qualified to write the article. There are at least 4 different comparatively high performance methods to create a “Numbers” table or function and none of them require the use of a loop or rCTE. In fact, all of them involve the use of a CROSS JOIN of one form or another. Since a good part of your article is based on the power of CROSS JOINs, it really leaves people wondering because you fell back on the “beginners” method of using a loop (RBAR) and didn’t use the power that you just explained.

The other whammy is also pretty big insofar as perception of your audience. Many people already know the power of a CROSS JOIN in the generation of data and are left wondering why you resorted to using a WHILE loop to generate the transaction data. Upon closer inspection, they also find that you’ve created a Scalar Function to support the generation of a properly constrained random number when a simple and very popular formula (indicating, again, that you may not have done the necessary research) would have done. Again, performance wise, it doesn’t matter much in this article but the use of the formula I’m talking about would have totally eliminated the need for both the WHILE loop and the Scalar Function.

As a bit of a side bar in this area, without the understanding that you’re generating mostly predictable data, people are left wondering why you have 20 identical transactions for 4 million customers. That limits the actual number of dates to a maximum of 20 and that’s not going to support performance testing for reports and the like in a broad enough scope.

For more information on the basic formula that I’m talking about and some of its permutations, please see the following articles.

http://www.sqlservercentral.com/articles/Data+Generation/87901/

http://www.sqlservercentral.com/articles/Test+Data/88964/

3. The next thing might pertain to knowing the scope of the audience. You used a function (SEQUENCE) that’s only available in 2012 in an article that has good generic application across many versions of SQL Server without showing an alternative. SQL Server 2012 is only a year or so old and there’s still a majority of people that are still “stuck” with 2005 and 2008. Showing how to use the IDENTITY function (which is available in all versions as far back as I can remember) would have been points in your favor.

The other perception problem in this area is that while you did show being “up to speed” by using a function available only in the latest production version of SQL Server, you didn’t use some of the more common basic functionality that some folks expect to see in the construction of data and the assignment of values to multiple variables. For example, you used multiple INSERT/VALUE statements instead of using SELECT/UNION ALL (all versions) or VALUES (2008+). You also used multiple adjacent SELECTs to assign values to multiple variables instead of using a single SELECT to do the same job. Again, performance wise, it’s not going to matter much for what the code is doing but it does seriously affect the impressions people walk away with.

4. Another serious perception breaker is based on some of the not-so-obvious errors in the write-up.

For example, you do explain how many rows will be produced and why but instead of the code generating the 8 million rows you said it would, it actually generates 80 million rows because you forgot to add in the additional CROSS JOIN you used to generate the street name. That “shows” people that you either didn’t pay attention to what you were writing or you didn’t test your own code enough to know what it actually does regardless of whether you actually did or not. If nothing else, it takes them by great surprise and they begin to wonder what else may be wrong with the code. Rather than taking the time to find out (people ARE busy), they give you a bad mark and flee.

In that same vein, you say “This step uses a CROSS JOIN to create 100 names”. Yes, it does do that but people are really taken back when that bit of code generates 4 million rows. It would have been better to say that the code “Generates 100 name combinations across thousands of combinations of address information for a total of 4 million rows". Again, the difference is quite subtle but can have a huge impact on people’s perception about the article. Of course, that may be a Phase II of the test data but it’s not explained anywhere and so people get the wrong perception.

To summarize, I hope the low marks on this article don’t turn you away from writing. You have the basic concepts of how to write and just need to take it to the next level. A bit more research on some of the tools you used to pull this article off would go a long way along with checking what the code returns and setting better expectations for what the code in the article will actually provide.

Thanks for taking the time to write the article and the time to read my suggestions. As “Red Green” would say, “We’re all in this together and I’m pullin’ for ya.”

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jamie Ingram-729524 SSCommitted Points: 1575 More actions · Answer 6

Thanks for the feedback Jeff. I appologise to the community for not checking my work more thoroughly.

My next arrticle will be of better quality.

Cheers

Jamie

We are the pilgrims, master.
We shall go always, a little further.

JJ B SSCertifiable Points: 5158 More actions · Answer 7

Jeff: Thanks for your sensitive and helpful reply. Other people (like myself!) than just the author will get a lot of benefit from it.

I also second your encouragement to the author not to give up. I appreciated the article.

webrunner SSC-Dedicated Points: 32231 More actions · Answer 8

Great reply, Jeff. I wondered the same things (low number of stars, hard-coded data), but think this is a great topic and I applaud the author for taking a crack at it.

I would love to see this mature into a tool that can randomize or otherwise transform real production data into realistic, but safer, data for non-production use.

Thanks again,

webrunner

-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

Jamie Ingram-729524 SSCommitted Points: 1575 More actions · Answer 9

On other projects that I have worked on, the dev data has occasionally been scrubbed so that names and phone numbers and other account details are gibberish.

Eg John Smith 0406 320453 becomes Joh$ $m$th 9999 999999, in which case you lose a lot of value that could be obtained by analyzing the data anyway.

If this prompts someone to write a better tool then I'm happy.

Cheers

We are the pilgrims, master.
We shall go always, a little further.

archangel3 Valued Member Points: 53 More actions · Answer 10

Nice article. A quick and simple way to get lots of data into tables.

webrunner SSC-Dedicated Points: 32231 More actions · Answer 11

Yes, that is usually the challenge. Especially a lot of repeats of the same dummy data.

Perhaps an idea would be to have a framework, and depending on what data needs to be tested, have different, customizable scripts for it.

So, if you want to test addresses, have a script that generates dummy, but realistic addresses, and so on. I'm thinking a "dummy data" generation database of its own that would house lookup tables to serve as the basis for the different types of dummy data that is needed. I'd think that for most cases, a set of common data would suffice (names, phones, emails, addresses). But the basic idea could be tailored for different uses.

Not the whole answer, I know, but maybe it's a start.

- webrunner

-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

Jamie Ingram-729524 SSCommitted Points: 1575 More actions · Answer 12

That's sort of what I was aiming at.

You can populate as many tables as you want with data and just use cartesian joins to give the bulk.

I have reused last names as street names.

The keyword here is bulk.

I'm happy to work with people on this framework.

Cheers

Jamie:-D

We are the pilgrims, master.
We shall go always, a little further.

webrunner SSC-Dedicated Points: 32231 More actions · Answer 13

Jamie Ingram-729524 (8/28/2013)
That's sort of what I was aiming at.
You can populate as many tables as you want with data and just use cartesian joins to give the bulk.
I have reused last names as street names.
The keyword here is bulk.
I'm happy to work with people on this framework.
Cheers
Jamie:-D

Oh, sorry! I hadn't looked at the code details, was just commenting on the general idea. I will take a closer look and test it out to see if I can offer further feedback.

- webrunner

-------------------
A SQL query walks into a bar and sees two tables. He walks up to them and asks, "Can I join you?"
Ref.: http://tkyte.blogspot.com/2009/02/sql-joke.html

kenny.ratliff1 SSC Rookie Points: 39 More actions · Answer 14

Greetings Jamie and anyone on this string -

I'm hoping to create dummy data for use in some health examples. This article look is great in getting me started that direction! However, I'm getting an error when I use the CREATE SEQUENCE that states 'Unknown object type 'sequence' used in a CREATE, DROP, or ALTER statement'. I am using SQL 2012, which I understand is the only version the SEQUENCE statement works.

Forgive me, as I'm a GIS guy and not a SQL guy! So maybe I'm missing something pretty simple?

Thanks!

Kenny