Steps to follow to create a randomised dataset for testing.

An online data generator: www.generatedata.com.

## Prepare the Database

We will be loading a lot of data into the tables. So

- Set Instant File Initialisation
- Set the data file size sufficient to contain the generated data set.
- Set Bulk Logged recovery mode.
- Set an appropriate log file size. Need not be too large if inserting to new tables in Bulk Logged recovery mode. Otherwise may need to be size of largest single insert.

## Create a Numbers Table

This is a standard numbers/tally table populated with integers 1 – *n* with the addition of a random number. This random number can be used to select other values for the data set. See below for rationale on choosing the random number range.

CREATE TABLE Numbers (Nbr int NOT NULL, RandomNbr int NOT NULL) -- Load Numbers Table DECLARE @TotalRows bigint = 2000000 DECLARE @RandomNumberRange int = 4000 ;WITH Pass0 as (select 1 as C union all select 1), --2 rows Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),--4 rows Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),--16 rows Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),--256 rows Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),--65536 rows Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),--4,294,967,296 rows Tally as (select row_number() over(order by C) as Number from Pass5) INSERT Numbers (Nbr ,RandomNbr) SELECT TOP(@TotalRows) Number, (ABS(CHECKSUM(NewId())) % @RandomNumberRange) + 1 FROM Tally -- Add clustered index ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Nbr)

The recursive CTE number generator is taken from here, inter alia.

## Create Random Scalar Values

Assemble a SELECT statement to suit the table you are populating from the following snippets:

INSERT INTO -- YourTable SELECT N.Nbr, -- incrementing primary key NewID(), -- random uniqueidentifier CHECKSUM(NewId()) % 99, -- random integer between 0 and 99. -- Dates of birth up to 75 yoa DATEADD(dd, - ABS(CHECKSUM(NewId())) % (365 * 75), CAST(GETDATE() AS date)), -- Value in only 10 in 4000 rows (see RandomNumberRange above) CASE WHEN N.RandomNbr <= 10 THEN 1 ELSE 0 END FROM Numbers N ORDER BY N.Nbr

## Select Values from a Seed File

It helps to have plausible names, addresses and contact details. Begin by populating another staging table with values – e.g. a set of names gathered from some source.

Note how many rows are in the source table: use this number for the upper bound of Numbers.RandomNbr (see above). The RandomNbr column is used to join to the source table.

INSERT INTO -- YourTable SELECT ... PN.FirstName, PN.LastName, .. FROM Numbers N INNER JOIN PersonalNames PN ON PN.PersonalNameId = N.RandomNbr ORDER BY N.Nbr

## How Random?

Using CHECKSUM(NewId()) % *n * will not produce mathematically random values; it depends on the modulo. In fact for testing it helps not to have random distribution, unless that is the characteristic of the data being modelled. If modelling a set of customers or sales agents it is naturally that some will be involved in more sales than another. Creating an unbalanced distribution involves applying some function to the random number generator to control its distribution. I have used trigonometry operators to weight a random distribution to one end of the scale.

E.g. the following weights the distribution at higher numbers. The distribution is not very convincing. It could be improved.

CAST(ABS(SIN(CHECKSUM(NewId()) % 5000) * 1000) AS INT) + 1

## References

- Online Test Data Generator: http://www.convertcsv.com/generate-test-data.htm (from the very useful format-mutating site.