Random fill.

  • Two tables,

    First table has one numeric field from 1 to n (consequtive).

    (thousands of rows).

    And one or more columns to hold values.

    Second table has a large number of rows (millions).

    A number of fields. One or more fields have to be filled randomly with a value from the first table.

    In the simplest scenario one column from table two has to get a random value from table 1.

    Example Table1

    1 B

    2 E

    3 N

    Table2 Before

    x y z

    p g m

    a b c

    p q r

    d e f

    Second column is target column after the fill table 2 could look like

    Table2

    x E z

    p B m

    a B c

    p N r

    d E f

    Any simple and fast script for this ?

    Ben

  • This could be one way to do it.

    UPDATE T2

    SET SomeColumn = RandomValues.SomeValue

    FROM Table2 AS T2

    CROSS APPLY (

    SELECT TOP(1) SomeValue

    FROM Table1

    ORDER BY CHECKSUM(NEWID())

    ) AS RandomValues

    -- Gianluca Sartori

  • spaghettidba (4/18/2014)


    This could be one way to do it.

    UPDATE T2

    SET SomeColumn = RandomValues.SomeValue

    FROM Table2 AS T2

    CROSS APPLY (

    SELECT TOP(1) SomeValue

    FROM Table1

    ORDER BY CHECKSUM(NEWID())

    ) AS RandomValues

    Thanks for your code example,

    But it produces only a single value in the target column.

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

    Create table Target

    (

    something varchar(30),

    SomethingElse varchar(30),

    full_name_txt varchar(30)

    )

    insert into Target values('a','g','m')

    insert into Target values('b','h','n')

    insert into Target values('c','i','o')

    insert into Target values('d','j','p')

    insert into Target values('d','k','q')

    insert into Target values('f','l','r')

    Create table Use_value

    (

    number int identity,

    column1 varchar(30),

    column2 varchar(30)

    )

    insert into Use_value values('rx','r1')

    insert into Use_value values('ry','r2')

    insert into Use_value values('rp','r3')

    insert into Use_value values('rg','r4')

    insert into Use_value values('rc','r5')

    insert into Use_value values('rd','r6')

    UPDATE T2

    SET full_name_txt = RandomValues.column1

    FROM Target AS T2

    CROSS APPLY (

    SELECT TOP(1) column1

    FROM Use_Value

    ORDER BY CHECKSUM(NEWID())

    ) AS RandomValues

    select * from Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

    somethingSomethingElsefull_name_txt

    agry

    bhry

    ciry

    djry

    dkry

    flry

    A working example:

    But this is not simple.

    Although it is fast.

    With a CTE I could use a virtual column RND, but that was slow in performance.

    --

    -- Step 1 add an extra column (rnd) to a table.

    -- Step 2 determine the maximum number of rows in the feed table. (count(*) and max(number) should have the same value).

    -- Step 3 give the extra column a random value.

    -- Step 4 set the column with the 'appointed' random value'

    -- Step 5 Remove the extra column

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

    Create table Target

    (

    something varchar(30),

    SomethingElse varchar(30),

    full_name_txt varchar(30)

    )

    insert into Target values('a','g','m')

    insert into Target values('b','h','n')

    insert into Target values('c','i','o')

    insert into Target values('d','j','p')

    insert into Target values('d','k','q')

    insert into Target values('f','l','r')

    Create table Use_value

    (

    number int identity,

    column1 varchar(30),

    column2 varchar(30)

    )

    insert into Use_value values('rx','r1')

    insert into Use_value values('ry','r2')

    insert into Use_value values('rp','r3')

    insert into Use_value values('rg','r4')

    insert into Use_value values('rc','r5')

    insert into Use_value values('rd','r6')

    BEGIN TRANSACTION

    ALTER TABLE Target ADD

    rnd int NULL

    COMMIT

    declare @Max_number int

    select @max_number = count(*) from Target

    UPDATE Target SET rnd =

    convert(integer,(ROUND(rand(checksum(newid()))* @max_number,0,1)))+1

    -- fill the column according to the random number

    UPDATE T SET full_name_txt = V.column1

    FROM Target T

    JOIN Use_value V ON rnd = V.number

    select * from Target

    BEGIN TRANSACTION

    ALTER TABLE Target

    DROP COLUMN rnd

    COMMIT

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

    /*

    -- produces

    somethingSomethingElsefull_name_txtrnd

    agrc5

    bhry2

    circ5

    djry2

    dkry2

    flrx1

    */

    Sorry that I didn't come up with an example.

    And now have changed the names of the tables/columns Sorry. (:blush:)

    Ben

  • My bad.

    The subquery returns always the same value because it is not correlated with the outer query.

    Well, looks like you solved your issue.

    -- Gianluca Sartori

  • spaghettidba (4/18/2014)


    My bad.

    The subquery returns always the same value because it is not correlated with the outer query.

    Well, looks like you solved your issue.

    This was the old solution.

    But I often run into trouble with this old script. **)

    It is part of a larger and more complex script.

    I would prefere a 'simpler' script without the extra columns.

    So I am trying to get rid of the extra RND column.

    Functional I succeeded with a CTE script but the performance is far less than the solution with the extra columns.

    So I am still looking for a simpler script which does perform well.

    (Have not succeeded yet).

    Ben.

    **)

    The extra columns can give all sort of problems.

    Collation problems.

    Code can not be rerun if aborted in a previous run.

    Makes rows longer and therefore can create deferred rows.

  • You don't want to modify the target table structure, as that will have lots of overhead.

    Does the target table have a unique key column(s)? If not, can you add an IDENTITY column to the target table to serve as a key?

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • WITH T1 AS (

    SELECT *, randowCol = NEWID()

    FROM Target

    ),

    T2 AS (

    SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

    FROM T1

    )

    UPDATE T2

    SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)

    -- Gianluca Sartori

  • ScottPletcher (4/18/2014)


    You don't want to modify the target table structure, as that will have lots of overhead.

    Actually adding a RND column to the table, using it and removing it hardly produces overhead. That solution was a lot faster than using a virtual RND column.

    (The extra RND column can become a real problem in some scenario's, it's not a performance problem, but does disturb a number of things).

    ScottPletcher (4/18/2014)


    Does the target table have a unique key column(s)? If not, can you add an IDENTITY column to the target table to serve as a key?

    No I cannot change the table permanently, only temporary as adding and later on removing the RND column. Adding an IDENTITY gives far far more overhead than adding the RND, so that is not a desired solution. The script is used on serveral very different databases on a number of tables. Some have an identity, some have unique key columns, but there are variations. (Code is used for more than 60 or 70 tables in three different databases).

    But how would a Unique key and/or an identity help. Maybe I can adjust the code to that. I have full control over the code/script not over the table desing.

    Ben.

    **)

    In my current version I have made the code for adding and removing the RND column more dynamic removing a part of the issues. But some issues still remain.

  • spaghettidba (4/18/2014)


    WITH T1 AS (

    SELECT *, randowCol = NEWID()

    FROM Target

    ),

    T2 AS (

    SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

    FROM T1

    )

    UPDATE T2

    SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)

    Just came home for dinner and just saw this solution.

    I'll try this after dinner.

    Thanks, hope that this works better than my CTE solution, this would reduce the complexity of the code.

    (Better for maintenance and usage of the code).

    Thanks, I'll report later on.

    Ben

  • ben.brugman (4/18/2014)


    spaghettidba (4/18/2014)


    WITH T1 AS (

    SELECT *, randowCol = NEWID()

    FROM Target

    ),

    T2 AS (

    SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

    FROM T1

    )

    UPDATE T2

    SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)

    Just came home for dinner and just saw this solution.

    I'll try this after dinner.

    Thanks, hope that this works better than my CTE solution, this would reduce the complexity of the code.

    (Better for maintenance and usage of the code).

    Thanks, I'll report later on.

    Ben

    Sorry but this did not work.

    The Rn number was 1 to the max of the target table. (Not the scale of the Use_value table).

    Numbers of RN Ran far higher than the number of rows in the Use_table, producing an error that a NULL could not be inserted.

    I could take a modula function on that but then it would be a repeating list not a random list.

    Thanks, but this did not work.

    I'll show the slow CTE sollution which worked but very slowly.

    But have to adjust that to the names of the example.

    (And run a test how long that is taking).

    Ben

  • print '-- Start of use_value.full_name_txt : '+convert(varchar(30),getdate(),126)

    --

    -- Get the max from the use_value table.

    --

    declare @Max_number int

    select @max_number = count(*) from use_value

    print 'number use_value, full_name_txt :'+convert(varchar(10), @max_number)

    ;

    WITH

    T AS (SELECT *, convert(integer,(ROUND(rand(checksum(newid()))* @max_number,0,1))) AS RND2 FROM Target)

    UPDATE T SET full_name_txt = b.column1

    FROM T

    JOIN use_value B

    ON rnd2 = B.number

    print '-- END of use_value.full_name_txt : '+convert(varchar(30),getdate(),126)

    This code is now allready running for over 10 minutes. (not finished yet) (still running 20 minutes)(aborted it after 27 minutes)

    The code with the extra RND field run for less than 2 minutes.

    Ben

  • Well, it worked on the sample data you provided.

    If you gave us more significant sample data, maybe someone could come up with something better.

    -- Gianluca Sartori

  • spaghettidba (4/18/2014)


    Well, it worked on the sample data you provided.

    If you gave us more significant sample data, maybe someone could come up with something better.

    Both tables where of the same size. I should have provided larger sample tables and of different sizes.

    But even with the tables of the same size your algoritm always produces a table were each value from the use_value table is used exactly once.

    If the target table would be 3 rows, only the first three rows from the use_value table would be used one time each.

    So this is not very random.

    Had I used tables of different sizes his effect would have been more visible.

    Sorry to have provided tables of the same sizes.

    See the code below.

    (Only the first rows of use_value are used and each only once).

    Ben

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

    Create table Target

    (

    something varchar(30),

    SomethingElse varchar(30),

    full_name_txt varchar(30)

    )

    insert into Target values('a','g','m')

    insert into Target values('b','h','n')

    insert into Target values('c','i','o')

    insert into Target values('d','j','p')

    -- insert into Target values('d','k','q')

    -- insert into Target values('f','l','r')

    -- insert into Target values('f2','l3','r3')

    -- insert into Target values('f1','l2','r3')

    Create table Use_value

    (

    number int identity,

    column1 varchar(30),

    column2 varchar(30)

    )

    insert into Use_value values('111','r1')

    insert into Use_value values('222','r2')

    insert into Use_value values('333','r3')

    insert into Use_value values('444','r4')

    insert into Use_value values('555','r5')

    insert into Use_value values('666','r6')

    insert into Use_value values('777','r6')

    insert into Use_value values('888','r6')

    insert into Use_value values('999','r6')

    ;

    WITH T1 AS (

    SELECT *, randowCol = NEWID()

    FROM Target

    ),

    T2 AS (

    SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

    FROM T1

    )

    UPDATE T2

    SET full_name_txt = (SELECT column1 FROM Use_value V WHERE RN = V.number)

    select * from Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Target') drop table Target

    If exists(select * from information_schema.TABLES where TABLE_NAME = 'Use_value') drop table Use_value

  • Oh, I see what you mean.

    Try this:

    WITH T1 AS (

    SELECT *, randowCol = NEWID()

    FROM Target

    ),

    T2 AS (

    SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

    FROM T1

    ),

    T3 AS (

    SELECT *, randowCol = NEWID()

    FROM Use_value

    ),

    T4 AS (

    SELECT *, RN = ROW_NUMBER() OVER (ORDER BY randowCol)

    FROM T3

    )

    UPDATE T2

    SET full_name_txt = (SELECT column1 FROM T4 AS V WHERE V.RN = T2.RN )

    -- Gianluca Sartori

  • spaghettidba (4/19/2014)


    Oh, I see what you mean.

    Try this:

    Tried the solution.

    And I am trying to understand the solution.

    First the solution does not work if the target table is larger than the use_value table.

    Understanding the solution.

    At first I did not (and maybe still do not) understand the solution.

    T2 gives an unique incremental number for each row.

    T4 gives an unique incremental number for each row.

    So connecting these two numbers 'should' give only one unique combination for each row.

    Running the software the result is that some number are used more then once or not. (As we want for the random requirement).

    I did not understand this.

    So I made a breakdown of the solution.

    Created a T2 table by using : SELECT * INTO T2 from T2

    Created a T2 table by using : SELECT * INTO T4 from T4

    Both at the end of the with statement.

    Then I run the update statement which can then be run exactly as it is at the end of the with statement.

    Then the affected column contains unique values.

    So clearly breaking the With statement down to distinct statements gives a different result.

    A conclusion for the moment could be:

    Within the WITH statement the T3 and T4 component are executed once for each target row.

    This could be an explanation why my slow solution is so extremely slow. Also using this solution on a target table with a lot of rows, this solution would be very slow.

    This is an important learning moment for me. Here is an example were the WITH construction behaves different from a 'stepped' constructing.

    The example:

    The example is kept 'small' so it is easier to inspect the result. The actual number or rows varies a lot but typical for the target is millions of rows and for the use_value it is thousands of row. But there are also occurences where the target and or the use_value table have a very limited number of rows. The target table even might be empty. The use_value table should at least have one row.

    There was only one example, I kept the target small so that it would be easy to inspect the endresult.

    I hoped that the problem was stated clearly enought to give this limited example. Sorry for that.

    Thanks for your time and attention,

    at least now I have an excelent example where the WITH statement behaves differently from a stepped solution.

    And this could be the anwser why my solution was extremely slow. (Thanks for your help with that).

    (Probably because the statements are not deterministic, by the random function).

    Thanks,

    Ben

Viewing 15 posts - 1 through 15 (of 28 total)

You must be logged in to reply to this topic. Login to reply