Random fill.

Question

Post reply

Random fill.

ben.brugman

SSChampion

Points: 13369
More actions
April 18, 2014 at 6:37 am

#290412

Two tables,
First table has one numeric field from 1 to n (consequtive).
(thousands of rows).
And one or more columns to hold values.
Second table has a large number of rows (millions).
A number of fields. One or more fields have to be filled randomly with a value from the first table.
In the simplest scenario one column from table two has to get a random value from table 1.
Example Table1
1 B
2 E
3 N
Table2 Before
x y z
p g m
a b c
p q r
d e f
Second column is target column after the fill table 2 could look like
Table2
x E z
p B m
a B c
p N r
d E f
Any simple and fast script for this ?
Ben

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply

spaghettidba SSC Guru Points: 105732 More actions · Answer 1

This could be one way to do it.

UPDATE T2

SET SomeColumn = RandomValues.SomeValue

FROM Table2 AS T2

CROSS APPLY (

SELECT TOP(1) SomeValue

FROM Table1

ORDER BY CHECKSUM(NEWID())

) AS RandomValues

-- Gianluca Sartori

ben.brugman SSChampion Points: 13369 More actions · Answer 2

spaghettidba (4/18/2014)
This could be one way to do it.
UPDATE T2
SET SomeColumn = RandomValues.SomeValue
FROM Table2 AS T2
CROSS APPLY (
SELECT TOP(1) SomeValue
FROM Table1
ORDER BY CHECKSUM(NEWID())
) AS RandomValues

Thanks for your code example,

But it produces only a single value in the target column.

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

Create table Target

(

something varchar(30),

SomethingElse varchar(30),

full_name_txt varchar(30)

)

insert into Target values('a','g','m')

insert into Target values('b','h','n')

insert into Target values('c','i','o')

insert into Target values('d','j','p')

insert into Target values('d','k','q')

insert into Target values('f','l','r')

Create table Use_value

(

number int identity,

column1 varchar(30),

column2 varchar(30)

)

insert into Use_value values('rx','r1')

insert into Use_value values('ry','r2')

insert into Use_value values('rp','r3')

insert into Use_value values('rg','r4')

insert into Use_value values('rc','r5')

insert into Use_value values('rd','r6')

UPDATE T2

SET full_name_txt = RandomValues.column1

FROM Target AS T2

CROSS APPLY (

SELECT TOP(1) column1

FROM Use_Value

ORDER BY CHECKSUM(NEWID())

) AS RandomValues

select * from Target

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

somethingSomethingElsefull_name_txt

agry

bhry

ciry

djry

dkry

flry

A working example:

But this is not simple.

Although it is fast.

With a CTE I could use a virtual column RND, but that was slow in performance.

--

-- Step 1 add an extra column (rnd) to a table.

-- Step 2 determine the maximum number of rows in the feed table. (count(*) and max(number) should have the same value).

-- Step 3 give the extra column a random value.

-- Step 4 set the column with the 'appointed' random value'

-- Step 5 Remove the extra column

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

Create table Target

(

something varchar(30),

SomethingElse varchar(30),

full_name_txt varchar(30)

)

insert into Target values('a','g','m')

insert into Target values('b','h','n')

insert into Target values('c','i','o')

insert into Target values('d','j','p')

insert into Target values('d','k','q')

insert into Target values('f','l','r')

Create table Use_value

(

number int identity,

column1 varchar(30),

column2 varchar(30)

)

insert into Use_value values('rx','r1')

insert into Use_value values('ry','r2')

insert into Use_value values('rp','r3')

insert into Use_value values('rg','r4')

insert into Use_value values('rc','r5')

insert into Use_value values('rd','r6')

BEGIN TRANSACTION

ALTER TABLE Target ADD

rnd int NULL

COMMIT

declare @Max_number int

select @max_number = count(*) from Target

UPDATE Target SET rnd =

convert(integer,(ROUND(rand(checksum(newid()))* @max_number,0,1)))+1

-- fill the column according to the random number

UPDATE T SET full_name_txt = V.column1

FROM Target T

JOIN Use_value V ON rnd = V.number

select * from Target

BEGIN TRANSACTION

ALTER TABLE Target

DROP COLUMN rnd

COMMIT

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

/*

-- produces

somethingSomethingElsefull_name_txtrnd

agrc5

bhry2

circ5

djry2

dkry2

flrx1

*/

Sorry that I didn't come up with an example.

And now have changed the names of the tables/columns Sorry. (:blush:)

Ben

spaghettidba SSC Guru Points: 105732 More actions · Answer 3

My bad.

The subquery returns always the same value because it is not correlated with the outer query.

Well, looks like you solved your issue.

-- Gianluca Sartori

ben.brugman SSChampion Points: 13369 More actions · Answer 4

spaghettidba (4/18/2014)
My bad.
The subquery returns always the same value because it is not correlated with the outer query.
Well, looks like you solved your issue.

This was the old solution.

But I often run into trouble with this old script. **)

It is part of a larger and more complex script.

I would prefere a 'simpler' script without the extra columns.

So I am trying to get rid of the extra RND column.

Functional I succeeded with a CTE script but the performance is far less than the solution with the extra columns.

So I am still looking for a simpler script which does perform well.

(Have not succeeded yet).

Ben.

**)

The extra columns can give all sort of problems.

Collation problems.

Code can not be rerun if aborted in a previous run.

Makes rows longer and therefore can create deferred rows.

ScottPletcher SSC Guru Points: 100942 More actions · Answer 5

You don't want to modify the target table structure, as that will have lots of overhead.

Does the target table have a unique key column(s)? If not, can you add an IDENTITY column to the target table to serve as a key?

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

spaghettidba SSC Guru Points: 105732 More actions · Answer 6

WITH T1 AS (

SELECT *, randowCol = NEWID()

FROM Target

),

T2 AS (

SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

FROM T1

)

UPDATE T2

SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)

-- Gianluca Sartori

ben.brugman SSChampion Points: 13369 More actions · Answer 7

ScottPletcher (4/18/2014)
You don't want to modify the target table structure, as that will have lots of overhead.

Actually adding a RND column to the table, using it and removing it hardly produces overhead. That solution was a lot faster than using a virtual RND column.

(The extra RND column can become a real problem in some scenario's, it's not a performance problem, but does disturb a number of things).

ScottPletcher (4/18/2014)
Does the target table have a unique key column(s)? If not, can you add an IDENTITY column to the target table to serve as a key?

No I cannot change the table permanently, only temporary as adding and later on removing the RND column. Adding an IDENTITY gives far far more overhead than adding the RND, so that is not a desired solution. The script is used on serveral very different databases on a number of tables. Some have an identity, some have unique key columns, but there are variations. (Code is used for more than 60 or 70 tables in three different databases).

But how would a Unique key and/or an identity help. Maybe I can adjust the code to that. I have full control over the code/script not over the table desing.

Ben.

**)

In my current version I have made the code for adding and removing the RND column more dynamic removing a part of the issues. But some issues still remain.

ben.brugman SSChampion Points: 13369 More actions · Answer 8

spaghettidba (4/18/2014)
WITH T1 AS ( SELECT *, randowCol = NEWID() FROM Target ), T2 AS ( SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol) FROM T1 ) UPDATE T2 SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)

Just came home for dinner and just saw this solution.

I'll try this after dinner.

Thanks, hope that this works better than my CTE solution, this would reduce the complexity of the code.

(Better for maintenance and usage of the code).

Thanks, I'll report later on.

Ben

ben.brugman SSChampion Points: 13369 More actions · Answer 9

ben.brugman (4/18/2014)
spaghettidba (4/18/2014)
WITH T1 AS ( SELECT *, randowCol = NEWID() FROM Target ), T2 AS ( SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol) FROM T1 ) UPDATE T2 SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)
Just came home for dinner and just saw this solution.
I'll try this after dinner.
Thanks, hope that this works better than my CTE solution, this would reduce the complexity of the code.
(Better for maintenance and usage of the code).
Thanks, I'll report later on.
Ben

Sorry but this did not work.

The Rn number was 1 to the max of the target table. (Not the scale of the Use_value table).

Numbers of RN Ran far higher than the number of rows in the Use_table, producing an error that a NULL could not be inserted.

I could take a modula function on that but then it would be a repeating list not a random list.

Thanks, but this did not work.

I'll show the slow CTE sollution which worked but very slowly.

But have to adjust that to the names of the example.

(And run a test how long that is taking).

Ben

ben.brugman SSChampion Points: 13369 More actions · Answer 10

print '-- Start of use_value.full_name_txt : '+convert(varchar(30),getdate(),126)

--

-- Get the max from the use_value table.

--

declare @Max_number int

select @max_number = count(*) from use_value

print 'number use_value, full_name_txt :'+convert(varchar(10), @max_number)

;

WITH

T AS (SELECT *, convert(integer,(ROUND(rand(checksum(newid()))* @max_number,0,1))) AS RND2 FROM Target)

UPDATE T SET full_name_txt = b.column1

FROM T

JOIN use_value B

ON rnd2 = B.number

print '-- END of use_value.full_name_txt : '+convert(varchar(30),getdate(),126)

This code is now allready running for over 10 minutes. (not finished yet) (still running 20 minutes)(aborted it after 27 minutes)

The code with the extra RND field run for less than 2 minutes.

Ben

spaghettidba SSC Guru Points: 105732 More actions · Answer 11

Well, it worked on the sample data you provided.

If you gave us more significant sample data, maybe someone could come up with something better.

-- Gianluca Sartori

ben.brugman SSChampion Points: 13369 More actions · Answer 12

spaghettidba (4/18/2014)
Well, it worked on the sample data you provided.
If you gave us more significant sample data, maybe someone could come up with something better.

Both tables where of the same size. I should have provided larger sample tables and of different sizes.

But even with the tables of the same size your algoritm always produces a table were each value from the use_value table is used exactly once.

If the target table would be 3 rows, only the first three rows from the use_value table would be used one time each.

So this is not very random.

Had I used tables of different sizes his effect would have been more visible.

Sorry to have provided tables of the same sizes.

See the code below.

(Only the first rows of use_value are used and each only once).

Ben

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

Create table Target

(

something varchar(30),

SomethingElse varchar(30),

full_name_txt varchar(30)

)

insert into Target values('a','g','m')

insert into Target values('b','h','n')

insert into Target values('c','i','o')

insert into Target values('d','j','p')

-- insert into Target values('d','k','q')

-- insert into Target values('f','l','r')

-- insert into Target values('f2','l3','r3')

-- insert into Target values('f1','l2','r3')

Create table Use_value

(

number int identity,

column1 varchar(30),

column2 varchar(30)

)

insert into Use_value values('111','r1')

insert into Use_value values('222','r2')

insert into Use_value values('333','r3')

insert into Use_value values('444','r4')

insert into Use_value values('555','r5')

insert into Use_value values('666','r6')

insert into Use_value values('777','r6')

insert into Use_value values('888','r6')

insert into Use_value values('999','r6')

;

WITH T1 AS (

SELECT *, randowCol = NEWID()

FROM Target

),

T2 AS (

SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

FROM T1

)

UPDATE T2

SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)

select * from Target

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

spaghettidba SSC Guru Points: 105732 More actions · Answer 13

Oh, I see what you mean.

Try this:

WITH T1 AS (

SELECT *, randowCol = NEWID()

FROM Target

),

T2 AS (

SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

FROM T1

),

T3 AS (

SELECT *, randowCol = NEWID()

FROM Use_value

),

T4 AS (

SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

FROM T3

)

UPDATE T2

SET full_name_txt = (SELECT column1 FROM T4 AS V WHERE V.RN = T2.RN )

-- Gianluca Sartori

ben.brugman SSChampion Points: 13369 More actions · Answer 14

spaghettidba (4/19/2014)
Oh, I see what you mean.
Try this:

Tried the solution.

And I am trying to understand the solution.

First the solution does not work if the target table is larger than the use_value table.

Understanding the solution.

At first I did not (and maybe still do not) understand the solution.

T2 gives an unique incremental number for each row.

T4 gives an unique incremental number for each row.

So connecting these two numbers 'should' give only one unique combination for each row.

Running the software the result is that some number are used more then once or not. (As we want for the random requirement).

I did not understand this.

So I made a breakdown of the solution.

Created a T2 table by using : SELECT * INTO T2 from T2

Created a T2 table by using : SELECT * INTO T4 from T4

Both at the end of the with statement.

Then I run the update statement which can then be run exactly as it is at the end of the with statement.

Then the affected column contains unique values.

So clearly breaking the With statement down to distinct statements gives a different result.

A conclusion for the moment could be:

Within the WITH statement the T3 and T4 component are executed once for each target row.

This could be an explanation why my slow solution is so extremely slow. Also using this solution on a target table with a lot of rows, this solution would be very slow.

This is an important learning moment for me. Here is an example were the WITH construction behaves different from a 'stepped' constructing.

The example:

The example is kept 'small' so it is easier to inspect the result. The actual number or rows varies a lot but typical for the target is millions of rows and for the use_value it is thousands of row. But there are also occurences where the target and or the use_value table have a very limited number of rows. The target table even might be empty. The use_value table should at least have one row.

There was only one example, I kept the target small so that it would be easy to inspect the endresult.

I hoped that the problem was stated clearly enought to give this limited example. Sorry for that.

Thanks for your time and attention,

at least now I have an excelent example where the WITH statement behaves differently from a stepped solution.

And this could be the anwser why my solution was extremely slow. (Thanks for your help with that).

(Probably because the statements are not deterministic, by the random function).

Thanks,

Ben