Generate Random Test Data

Steps to follow to create a randomised dataset for testing.

An online data generator:  www.generatedata.com.

Prepare the Database

We will be loading a lot of data into the tables. So

  • Set Instant File Initialisation
  • Set the data file size sufficient to contain the generated data set.
  • Set Bulk Logged recovery mode.
  • Set an appropriate log file size. Need not be too large if inserting to new tables in Bulk Logged recovery mode. Otherwise may need to be size of largest single insert.

Create a Numbers Table

This is a standard numbers/tally table populated with integers 1 – n with the addition of a random number. This random number can be used to select other values for the data set.  See below for rationale on choosing the random number range.

CREATE TABLE Numbers
	(Nbr int NOT NULL,
	RandomNbr int NOT NULL)

-- Load Numbers Table
DECLARE @TotalRows bigint = 2000000
DECLARE @RandomNumberRange int = 4000
;WITH
  Pass0 as (select 1 as C union all select 1), --2 rows
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),--4 rows
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),--16 rows
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),--256 rows
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),--65536 rows
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),--4,294,967,296 rows 
  Tally as (select row_number() over(order by C) as Number from Pass5)
INSERT Numbers
        (Nbr
		,RandomNbr)
    SELECT TOP(@TotalRows) Number, 
		(ABS(CHECKSUM(NewId())) % @RandomNumberRange) + 1
        FROM Tally

-- Add clustered index
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Nbr)

The recursive CTE number generator is taken from here, inter alia.

Create Random Scalar Values

Assemble a SELECT statement to suit the table you are populating from the following snippets:

INSERT INTO
  -- YourTable
SELECT
  N.Nbr,   -- incrementing primary key
  NewID(), -- random uniqueidentifier
  CHECKSUM(NewId()) % 99,  -- random integer between 0 and 99.  
     -- Dates of birth up to 75 yoa
  DATEADD(dd, - ABS(CHECKSUM(NewId())) % (365 * 75), CAST(GETDATE() AS date)), 
     -- Value in only 10 in 4000 rows (see RandomNumberRange above)
  CASE WHEN N.RandomNbr <= 10 THEN 1 ELSE 0 END
FROM Numbers N
ORDER BY N.Nbr

Select Values from a Seed File

It helps to have plausible names, addresses and contact details.  Begin by populating another staging table with values – e.g. a set of names gathered from some source.

Note how many rows are in the source table: use this number for the upper bound of Numbers.RandomNbr (see above).  The RandomNbr column is used to join to the source table.

INSERT INTO
  -- YourTable
SELECT
  ...
  PN.FirstName,
  PN.LastName,
  ..
FROM Numbers N
  INNER JOIN PersonalNames PN ON PN.PersonalNameId = N.RandomNbr
ORDER BY N.Nbr

 

How Random?

Using CHECKSUM(NewId()) %  will not produce mathematically random values; it depends on the modulo.  In fact for testing it helps not to have random distribution, unless that is the characteristic of the data being modelled.  If modelling a set of customers or sales agents it is naturally that some will be involved in more sales than another.  Creating an unbalanced distribution involves applying some function to the random number generator to control its distribution.  I have used trigonometry operators to weight a random distribution to one end of the scale.

E.g. the following weights the distribution at higher numbers.  The distribution is not very convincing. It could be improved.

CAST(ABS(SIN(CHECKSUM(NewId()) % 5000) * 1000) AS INT) + 1

References