SQLServerCentral Article

Generating Dummy Data

,

 Generating Dummy Data for Testing

I have worked on a number of projects where there hasn’t been enough data to test reports and extracts and other database applications. The database was a third party database and the vendor wouldn’t allow data to be entered via the back end. We didn’t have enough data to test response times or to tune SQL until after GO-Live. This caused a lot of rework after GO-Live which in turn affected the perception of quality of the project as a whole.

The purpose of this article is to provide a means of generating random data which can be inserted into a database for testing purposes.

The scripts and methodology allow sample data to be generated. The scripts below create a number of tables for storing different data descriptors, e.g. name, street type, number, transactions. Some tables are reused to generate street names and post codes.

If you wanted data that was only based on data type and had no other meaning you could classify the tables by data type; e.g. datetime, character and numeric.

Data is generated by creating Cartesian joins on the tables. Using the generated dataset this gives 8,000,000 (10 x 10 x 1000 x 4 x 20 x 1) rows of data. Id values are generated using a sequence. Child – Parent relationships are created by including the parent id in a Cartesian product with the child data (See below).

Diagram 1 - Populating the Application database

Diagram 2 - Use of Synonyms where changes to the application database are not allowed.

Architecture

The solution is broken into a number of layers. They are:

  • the Base Data Layer
  • the Reporting Data Layer

These layers can be within different schemas within the database.

Base Data Layer

The base data layer contains the dummy data tables, any data generation scripts, and a sequence or sequences for generating Ids.

Reporting Data Layer

This layer contains derived tables and views which create data similar to what would be seen in the application that you are testing. There may be a number of steps involved in getting data from the base tables into a format that can be imported into your application database.

If data cannot be created in the application database, create tables in the Reporting Data Layer and then create synonyms to them, populate the tables and use the synonyms for reporting. When the system goes live, point the synonyms to the application database. In this case, the Reporting Data Layer and Base Data Layers may be in a separate database to the application database.

How Cartesian joins work

A Cartesian join multiplies every row in a dataset with every row in the joined dataset. Imagine I have two tables with data as shown below:

First_Name                                        Last_Name

===========                                       ===============

Bill                                              S Preston Esq

Ted                                               Theodore Logan

Attila the Hun                                    Attila the Hun

SQL Server TSQL provides the CROSS JOIN which operates as a Cartesian join, unless you include a WHERE clause which links the two tables; e.g.

Case 1 Cartesian join

Select First_Name.Name + ' ' + Last_Name.Name Surname
 From  First_name
  Cross Join Last_Name

Result set

Surname

============

Bill S Preston Esq

Ted S Preston Esq

Attila the Hun S Preston Esq

Bill Theodore Logan

Ted Theodore Logan

Attila the Hun Theodore Logan

Bill Attila the Hun

Ted Attila the Hun

Attila the Hun Attila the Hun

Case 2 Inner join

Select First_Name.Name + ' ' + Last_Name.Name Surname
 From  First_name
  Cross Join Last_Name
 Where First_Name.Name = Last_Name.Name

Result set

Surname

========

Attila the Hun Attila the Hun

Case 3 Cartesian join with where clause

Select First_Name.Name + ' ' + Last_Name.Name Surname
 From  First_name
  Cross Join Last_Name
 Where First_Name.Name = ‘Attila the Hun’

Result set

Surname

===============================

Attila the Hun S Preston Esq

Attila the Hun Theodore Logan

Attila the Hun Attila the Hun

Creating the data

The attached scripts will generate the following database items:

Tables

There are 5 tables created with the names and rowcounts shown below.

Name Rows
First_Name 10
Last_Name 10
Numbers 1000
Street_Types 4
Transactions  20

In addition, there is a function and a sequence needed.

Function 

  • Random_Number

Sequences

  • dbo.Dummy     

  

Step 1 Create a new database.

No TSQL is included here as nearly everyone’s hardware setup will be different. The database in this example is called Dummy_Data.

Step 2 Create Database Objects

These scripts create the objects listed above.

USE Dummy_Data
GO
CREATE TABLE dbo.First_Names(
 Name VARCHAR(250) NOT NULL PRIMARY KEY
)
GO
CREATE TABLE dbo.Last_Names(
 Name VARCHAR(250) NOT NULL PRIMARY KEY
)
GO
CREATE TABLE dbo.Numbers(
 Num SMALLINT NOT NULL PRIMARY KEY
)
GO
CREATE TABLE dbo.Street_Type(
 Street_Type VARCHAR(20) NOT NULL PRIMARY KEY
)
GO
CREATE TABLE dbo.Transactions(
 Date1 DATETIME NULL,
 Date2 DATETIME NULL,
 Amount1 MONEY NULL,
 Amount2 MONEY NULL,
 Amount3 MONEY NULL
)
GO
create sequence dbo.Dummy
 start with 1
 increment by 1
GO
CREATE FUNCTION Random_Number(@Lower INT, @Upper INT, @seed FLOAT)
 RETURNS INT
AS
BEGIN
 DECLARE @Random INT;
 SELECT @Random = ROUND(((@Upper - @Lower -1) * @Seed + @Lower), 0)
 RETURN @Random
END
GO

Step 3 Populate Names Data

Next we need to insert data into these tables that will be used to generate our other data. Please feel free to modify these values as needed.

INSERT INTO First_Names VALUES('Bill')
INSERT INTO First_Names VALUES('Bob')
INSERT INTO First_Names VALUES('Buffy')
INSERT INTO First_Names VALUES('Ethan')
INSERT INTO First_Names VALUES('Jamie')
INSERT INTO First_Names VALUES('John')
INSERT INTO First_Names VALUES('Kevin')
INSERT INTO First_Names VALUES('Lee')
INSERT INTO First_Names VALUES('Lincoln')
INSERT INTO First_Names VALUES('Megan')
GO
INSERT INTO Last_Names VALUES('Blutofski')
INSERT INTO Last_Names VALUES('Brown')
INSERT INTO Last_Names VALUES('Fitz-Waller')
INSERT INTO Last_Names VALUES('Fluffy')
INSERT INTO Last_Names VALUES('Grey')
INSERT INTO Last_Names VALUES('Guinea-Pig')
INSERT INTO Last_Names VALUES('Holmes')
INSERT INTO Last_Names VALUES('Whalloper')
INSERT INTO Last_Names VALUES('Ingram')
INSERT INTO Last_Names VALUES('Bonkers')
GO
INSERT INTO street_type VALUES ('Drive')
INSERT INTO street_type VALUES ('Place')
INSERT INTO street_type VALUES ('Road')
INSERT INTO street_type VALUES ('Street')
GO

Step 4 Populate Numbers Table

The numbers table can fulfil a number of uses. The data in this table has been limited to 1000 entries, but you can change that if you feel the need. This script populates the table with consecutive numbers

DECLARE @iVal INT
BEGIN
 SET @iVal = 1;
 WHILE @iVal <= 1000
  BEGIN
   INSERT INTO Numbers VALUES(@iVal)
   SET @iVal = @iVal + 1
  END
END
GO

Step 5 Populate Transaction Data

The transaction data consists of two dates and three amounts. The first date occurs in the first half of the month and the second date occurs in the last half of the month. The three amounts have been determined at random. The second amount is 10% of the third amount, and the first amount = the second amount + the third amount. These numbers can be used to emulate sales values, profits etc. 

The script to create this data is below.

BEGIN
DECLARE @ddate DATETIME
DECLARE @ddate2 DATETIME
DECLARE @amt MONEY
DECLARE @amt2 MONEY
DECLARE @amt3 MONEY
DECLARE @dollars INT
DECLARE @cents INT
DECLARE @year INT
DECLARE @month INT
DECLARE @day INT
DECLARE @iLoop INT
SELECT @iLoop = 1
WHILE (@iLoop <= 20)
 BEGIN
  --dates
  SELECT @day = dbo.Random_Number(1,14,Rand())
  SELECT @month = dbo.Random_Number(1,12,Rand())
  SELECT @year = dbo.Random_Number(2010,2013,Rand())
  SELECT @ddate = cast(cast(@month AS VARCHAR(2)) + '/' + CAST(@day AS VARCHAR(2)) + '/' + CAST(@year AS VARCHAR(4)) AS DATETIME)
  SELECT @day = dbo.Random_Number(15,28,Rand())
  SELECT @ddate2 = CAST(CAST(@month AS VARCHAR(2)) + '/' + CAST(@day as VARCHAR(2)) + '/' + CAST(@year AS VARCHAR(4)) AS DATETIME)
  --amounts
  SELECT @dollars = dbo.Random_Number(15,1000,RAND())
  SELECT @cents = dbo.Random_Number(0,99,RAND())
  SELECT @amt = CONVERT(MONEY,CAST(@dollars AS VARCHAR(4)) + '.'
  + RIGHT('00' + CAST(@cents AS VARCHAR(2)),2))
  SELECT @dollars = @amt /1.1
  SELECT @amt3 = CONVERT(MONEY,CAST(@dollars AS VARCHAR(4)) + '.' + RIGHT('00' + CAST(@cents AS VARCHAR(2)),2))
  SELECT @amt2 = @amt - @amt3
  INSERT INTO Transactions VALUES(@ddate,@ddate2,@amt,@amt2,@amt3)
  SELECT @iLoop = @iLoop + 1
 END --loop
END
GO

Step 6 Creating Names

This step uses a CROSS JOIN to create 100 names. The id is generated using the sequence.

SELECT NEXT VALUE FOR dbo.Dummy Id,
 fn.name First_Name,
 ln.name Last_Name,
 nm.Num Street_no,
 sn.Name as street_name,
 st.Street_Type
 INTO Customer_Data
 FROM first_names fn
  CROSS JOIN last_names ln
  CROSS JOIN numbers nm
  CROSS JOIN last_names sn
  CROSS JOIN street_type st

Step 7 Creating Transactions

This step uses a CROSS JOIN to populate a new table with transactions and IDs. The Ids will match the customer Ids. Every customer will end up with a full set of 20 transactions.

SELECT nd.Id,
       tr.*
  INTO Transaction_Data
 FROM Customer_Data nd
  CROSS JOIN Transactions tr

Conclusion

This article has highlighted three things:

  • The use of CROSS JOIN

    • Without a where clause table A is multiplied by table B
    • With a where clause that links the two tables the result is the same as an inner join
    • With a where clause that limits the data in either table
  • Generating sample data by using CROSS JOIN
  • The use of synonyms to provide a level of abstraction to your reporting layer, remembering that any optimisation via indexes will still need to occur on the application database.

I hope this helps some of you.

Rate

2.42 (19)

You rated this post out of 5. Change rating

Share

Share

Rate

2.42 (19)

You rated this post out of 5. Change rating