Performance Question

  • Thank's for all the good hints:

    I modified my UDF an now I can show the code:

    public static SqlString RNGCharacterMask(int codesize, string characters)

    {

    int maxSize = codesize;

    int minSize = codesize;

    char[] chars = new char[20];

    string a;

    a = characters;

    chars = a.ToCharArray();

    int size = maxSize;

    byte[] data = new byte[1];

    RNGCryptoServiceProvider crypto = new RNGCryptoServiceProvider();

    crypto.GetNonZeroBytes(data);

    size = maxSize;

    data = new byte;

    crypto.GetNonZeroBytes(data);

    StringBuilder result = new StringBuilder(size);

    foreach (byte b in data)

    { result.Append(chars); }

    return result.ToString();

    }

    Hope that clearify my problem a little bit.

    Thank's in advance

  • Ouch.

    I would modify the CLR Udf to returns a data set of '@amount' Codes and store those to a temp table.

    Then delete duplicates with the codes table and re-execute the UDF returning the rowcount from the delete.

    Repeat until there are no duplicate rows.

    This will minimize the amount of round trips to the udf , plus the SQL logic is much more set based.

    Ive never used the RNGCryptoServiceProvider class but you may only have to have one instance of if it if GetBytes returns a different value at each call.

    As another thought though does all this function do return a string of variable(edit: random) characters from '34679CDFGHJKLNPRTVXY' ?

    If so a pure TSQL solution should be pretty simple

    Post back if this isn't clear



    Clear Sky SQL
    My Blog[/url]

  • Can you please provide me a sample TSQL

    Application which replaces the UDF Function ?

    ThanK's

  • try this , ill be interested to hear how it performs in comparison.

    Note that the case statement on the call to GetRandomCode is required otherwise you will get the same value returned.

    On my dev server it creates 15,000,000 rows in about 11 mins, cutting down the size of the union to your limited set should improve matters.

    CREATE TABLE dbo.Numbers (Num INT NOT NULL PRIMARY KEY CLUSTERED);

    GO

    DECLARE @i INT;

    SELECT @i = 1;

    WHILE @i <= 10000

    BEGIN

    INSERT INTO dbo.Numbers(Num) VALUES (@i);

    SELECT @i = @i + 1;

    END;

    Create View VwNewId

    as

    Select New_Id = newid()

    go

    Drop Function GetRandomCode

    go

    Create Function GetRandomCode(@CharsNeeded integer)

    returns table

    as

    return

    (

    with cteCharsNeeded(Character)

    as(

    select '1' union all select '2' union all select '3' union all select '4' union all select '6' union all select '7' union all select '8' union all select '9' union all Select '0' union all

    select 'A' union all select 'B' union all select 'C' union all select 'D' union all select 'E' union all select 'F' union all select 'G' union all select 'H' union all Select 'I' union all

    select 'J' union all select 'K' union all select 'L' union all select 'M' union all select 'N' union all select 'O' union all select 'P' union all select 'Q' union all Select 'R' union all

    select 'S' union all select 'T' union all select 'U' union all select 'V' union all select 'W' union all select 'X' union all select 'Y' union all select 'Z' union all

    select 'a' union all select 'b' union all select 'c' union all select 'd' union all select 'e' union all select 'f' union all select 'g' union all select 'h' union all select 'i' union all

    select 'j' union all select 'k' union all select 'l' union all select 'm' union all select 'n' union all select 'o' union all select 'p' union all select 'q' union all select 'r' union all

    select 's' union all select 't' union all select 'u' union all select 'v' union all select 'w' union all select 'x' union all select 'y' union all select 'z'

    ),

    ctenumbers(num)

    as

    (

    Select Num from numbers

    where Num <= @CharsNeeded

    ),

    cteRandomChars(num,c)

    as

    (

    select num,c = chars.c

    from ctenumbers cross apply (select top 1 Character from cteCharsNeeded,vwNewId where num = num order by new_id) as chars(c)

    )

    select (

    select c as [text()]

    from cteRandomChars

    for xml path('')) as Random

    )

    go

    drop table #res

    go

    create table #res

    (

    random char(8)

    )

    insert into #res

    select Random from sysobjects cross apply GetRandomCode(case when id is not null then 8 else null end)



    Clear Sky SQL
    My Blog[/url]

  • Ok give this a try...

    1,000,000 random strings in 24Seconds

    CREATE TABLE dbo.Numbers (Num INT NOT NULL PRIMARY KEY CLUSTERED);

    GO

    DECLARE @i INT;

    SELECT @i = 1;

    WHILE @i <= 1000

    BEGIN

    INSERT INTO dbo.Numbers(Num) VALUES (@i);

    SELECT @i = @i + 1;

    END;

    go

    update statistics Numbers with fullscan

    go

    drop View VwNewCheck

    go

    Create View VwNewCheck

    as

    Select abs(checksum(NewId())) as New_Id

    go

    Drop Function GetRandomCode

    go

    Create Function GetRandomCode(@StrLen integer,@CharsNeeded char(62))

    returns table

    as

    return

    (

    with cteRandomChars(num,c)

    as

    (

    Select Num,substring(@CharsNeeded,(Select VwNewCheck.new_id%(len(@CharsNeeded)-1)+1 from VwNewCheck where num = num ),1)

    from numbers

    where Num <= @StrLen

    )

    select (

    select c as [text()]

    from cteRandomChars

    for xml path('')) as random

    )

    go

    create table #random

    (

    random char(8)

    )

    insert into #random

    select top 15000000 Random from sysobjects cross apply sysobjects so1 cross apply GetRandomCode(case when so1.id is not null or sysobjects.id is not null then 8 else 0 end,'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789')

    select @@rowcount



    Clear Sky SQL
    My Blog[/url]

  • Once you have a way of generating many codes all at once, you need a way to efficiently check for duplicates, and to reach your target for new unique codes.

    The script below illustrates some techniques. I wrote this before seeing Dave Ballantyne's posts, so I use a different method to generate 'random' codes.

    This site will ruin the formatting, so I have attached it as a file as well.

    Paul

    -- Safe database

    USE tempdb;

    GO

    -- Drop our test table if it exists

    IF OBJECT_ID(N'dbo.UniqueCodes', N'U') IS NOT NULL DROP TABLE dbo.UniqueCodes;

    GO

    -- Create the test table

    CREATE TABLE dbo.UniqueCodes

    (

    codeVARCHAR(10) NOT NULL UNIQUE,-- our unique codes

    hash_codeINTEGER NOT NULL,-- a hash of the unique code to optimize lookups

    );

    GO

    -- Create an index on the hash code

    CREATE NONCLUSTERED INDEX nc1 ON dbo.UniqueCodes (hash_code);

    --

    -- Main script start

    -- (you may have to run this several times before unique code collisions start to occur)

    --

    DECLARE@RowsToAdd BIGINT,-- Unique codes to add to the table

    @MaxGen BIGINT,-- Maximum number of random codes to generate (some may already exist)

    @MaxPasses INT, -- Maximum number of passes to make before giving up

    @Magic BIGINT, -- A small number of extra rows to cope with generated duplicates and duplicates in the table (optimization, not critical)

    @pass INT-- The current pass

    --

    -- Change these values to experiment

    --

    SELECT@RowsToAdd = 25000,-- We want to add these many codes

    @MaxGen = 25100,-- Prepared to generate up to this many codes on each pass

    @Magic = 25,-- This magic number can help produce a plan with a flow distinct if not many rows are needed

    @MaxPasses = 5,-- Prepared to try five times

    @pass = 0-- First pass (don't change this)

    -- Optional

    -- UPDATE STATISTICS dbo.UniqueCodes;

    -- Loop while we have unique codes to find, and have passes remaining

    WHILE(@RowsToAdd > 0 AND @pass 0.2 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +

    CASE WHEN RAND() > 0.3 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +

    CASE WHEN RAND() > 0.4 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +

    CASE WHEN RAND() > 0.5 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +

    CASE WHEN RAND() > 0.6 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +

    CASE WHEN RAND() > 0.7 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +

    CASE WHEN RAND() > 0.8 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END +

    CASE WHEN RAND() > 0.9 THEN CHAR(RAND(CHECKSUM(NEWID())) * 26 + 65) ELSE '' END

    FROMmaster.sys.columns C1, master.sys.columns C2, master.sys.columns C3

    ),

    A (data) AS

    (

    -- Generate the hash code (a simple CHECKSUM will do here)

    -- add filter out any duplicate codes we generated

    SELECTDISTINCT

    TOP (@RowsToAdd + @Magic)

    N.data

    FROMRandom N

    ),

    B (data, hash_code, may_clash) AS

    (

    -- Create a flag which is set if the code may already exist in the target table

    -- If the flag is zero, the code definitely does not already exist, so we can add it

    SELECTTOP (@RowsToAdd + @Magic)

    data,

    CHECKSUM(data),

    CASE WHEN EXISTS (SELECT * FROM dbo.UniqueCodes U WHERE U.hash_code = CHECKSUM(A.data)) THEN 1 ELSE 0 END

    FROMA

    )

    -- Add a maximum of @RowsToAdd rows from CTE 'B'

    -- The flag allows the optimizer to use start-up filters to optimize lookups on the existing data

    INSERT dbo.UniqueCodes (code, hash_code)

    SELECTTOP (@RowsToAdd)

    B.data, B.hash_code

    FROMB

    WHEREmay_clash = 0 -- Safe to add if the flag is zero

    OR(

    -- If the flag is set, we need to check the code explicitly, since

    -- the hash shows a possible match

    may_clash = 1

    AND NOT EXISTS

    (

    -- Can add if the code does not already exist

    SELECT*

    FROMdbo.UniqueCodes U

    WHEREU.code = B.data

    )

    )

    OPTION(RECOMPILE);-- The optimal plan may change significantly between calls

    -- Adjust the number of codes we need to find, and increment the pass counter

    SELECT@RowsToAdd = @RowsToAdd - @@ROWCOUNT,

    @pass = @pass + 1;

    -- Informational: pass number & rows remaining

    IF @RowsToAdd > 0 SELECT [Pass #] = @pass, [Rows Remaining] = @RowsToAdd;

    END;

    -- Error if we failed

    IF@RowsToAdd > 0 RAISERROR('Failed to add the required number of codes - %i remain', 16, 1, @RowsToAdd);

    GO

    SELECT [Total Unique Codes] = COUNT_BIG(*) FROM dbo.UniqueCodes;

  • Hi,

    While your table has more then 100 mil. why you don't try to crete triggers and one another table. Changes on primary table such as delete or insert should update row on another table.

    The new tabel can have only one column type of int or long int and on this column can be saved the total number of rows in primary table. After that you don;t need to scan all 100 mil rows.

  • It's not the function that takes time here.

    It's the LOOP. See this row of code

    if ((select count(code) from codes where code = @tempcode) = 0)

    For each created code, you COUNT the Codes table if the code already exists. And THAT will take a LONG TIME.

    Try this instead

    IF NOT EXISTS (SELECT * FROM Codes WHERE Code = @TempCode)

    And you will be better off using IGNORE_DUP_KEY on your table and insert all created tempcodes. If successful, the @@ROWCOUNT will be 1 and @@ERROR = 0, if not successful @@ROWCOUNT = 0 AND @@ERROR 0.

    Then you can keep track of the successful inserts and increment a variable.

    Rewrite the code and make it work.


    N 56°04'39.16"
    E 12°55'05.25"

  • Try this

    CREATE TABLE [dbo].[Codes](

    [Code] [char](10) NOT NULL,

    CONSTRAINT [PK_Codes] PRIMARY KEY CLUSTERED

    (

    [Code] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = ON, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]

    ) ON [PRIMARY]

    ALTER PROCEDURE [dbo].[CreateCodes]

    (

    @amount bigint

    )

    AS

    SET NOCOUNT ON

    set @amount = abs(@amount )

    while @amount > 0

    begin

    insertcodes

    (

    code

    )

    SELECT [dbo].[RNGCharacterMask](8, '34679CDFGHJKLNPRTVXY')

    IF @@ERROR = 0 AND @@ROWCOUNT = 1

    SET @Amount = @Amount - 1

    end


    N 56°04'39.16"
    E 12°55'05.25"

  • Peso (7/10/2009)


    It's not the function that takes time here.

    It's the LOOP.

    That has already been covered in a previous post.

    It rather misses the point anyway - there is no need for a loop, a set-based solution will be much faster (especially since the expensive random-generating class is instantiated for each call). That has also already been covered by several people.

    Peso (7/10/2009)


    And you will be better off using IGNORE_DUP_KEY on your table and insert all created tempcodes

    There is are some important caveats to this. (Many people avoid IGNORE_DUP_KEY for these, and other, reasons)

    First, the IGNORE_DUP_KEY should be set on a clustered index only. It would be unwise to change the definition to make the primary key non-clustered. Adding a future non-clustered index with IGNORE_DUP_KEY would produce the same effect: the query plan changes dramatically, making it cheaper to check for and eliminate potential duplicates in the query rather than relying on the engine to handle the duplicates. See this blog entry by Craig Freeman for full details.

    The second caveat is to note that IGNORE_DUP_KEY only works with INSERT statements. Any UPDATE statement that would result in a duplicate key will raise an error and roll back as normal.

    Third, one needs to be aware that if a key does not exist in the table, but the INSERT set contains duplicates for that key value, one row will be inserted, and one will be rejected. Exactly which row is inserted is non-deterministic. One could argue that this breaks the normally atomic behaviour of a transaction where either all rows are inserted or none are.

    Fourth, you should be aware that some client APIs misinterpret SQL Server's 'duplicate key was ignored' warning as an error.

    Paul

    Given that the above considerations are neither well documented, nor well-known, it would make sense to highlight these issues when recommending a solution based on IGNORE_DUP_KEY.

  • Yes, I have read the previous posts.

    But why suggest an alternative that returns an error saying "Sorry mate, couldn't create as many codes as you wanted."

    This behaviour wasn't there before and why introduce it at all?

    OP just wants the stored procedure to create the number of codes he wants with no fuzz.

    I admit the set-based way to create the codes is WAY more efficient but doing this, you also need to take care of the duplicates in the temp table.

    One way or another, you're bound to check for number of unique records created in one way or another.

    And the CHECKSUM approach for hash value? What can I say more than it's a BAD choice.

    See here how checksum is calculated. It will provide very many false positives... CHECKSUM('bQ') = CHECKSUM('aa') and CHECKSUM('ABA') = CHECKSUM('ACQ')

    http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=70832


    N 56°04'39.16"
    E 12°55'05.25"

  • Or this pseudo code..

    INSERT INTO #Temp (Code)

    SELECT ... (record count equals the existing number of codes in Codes table plus the new wanted number).

    insert into Codes

    SELECT TOP (@wantedrecords)

    from #Temp AS t

    where not exist (select * from codes as c where c.code = t.code)

    It only requires two set-based operations and will always work without any errors, as long as the first creation guarantees unique output.


    N 56°04'39.16"
    E 12°55'05.25"

  • Peso (7/10/2009)


    But why suggest an alternative that returns an error saying "Sorry mate, couldn't create as many codes as you wanted." This behaviour wasn't there before and why introduce it at all?

    Because it is a script to illustrate technique, not a cut-and-paste solution, as I stated explicitly in that post.

    Peso (7/10/2009)


    I admit the set-based way to create the codes is WAY more efficient but doing this, you also need to take care of the duplicates in the temp table. One way or another, you're bound to check for number of unique records created in one way or another.

    Absolutely. So one might as well do it in a robust and efficient manner.

    Peso (7/10/2009)


    And the CHECKSUM approach for hash value? What can I say more than it's a BAD choice.

    The comment in the code states that CHECKSUM is sufficient for the demo script. If you are that concerned about collisions, use HashBytes instead. I would argue that CHECKSUM is faster and eliminates sufficient joins to make it well worth doing. Try running the script (instead of making assumptions) - CHECKSUM is a perfectly usable hash function.

    Please also see my comments regarding your choice of IGNORE_DUP_KEY in my previous post.

    Paul

  • Hi Peso ,

    I think you are being a little unfair..

    checksumming the newid value is pretty safe due to the size (in terms of bytes) or the uniqueidentifier type. You could reverse engineer 2 ids to have the same checksum , but after abs(checksum) 1 million newid()'s i have 217 collisions. I think generating an 8 character random string with this method is as random as any method can be. Having populated 2million rows with my test script , i only have 1 duplicate string. I think that is is 'fit for purpose'.



    Clear Sky SQL
    My Blog[/url]

  • Hey Dave,

    Your logic also applies to the hash_code calculated on my UniqueCodes table as well - since the hash is for the code column which is VARCHAR(10).

    In my current UniqueCodes table I have 845K rows. Duplicate hash values based on the CHECKSUM exist for 29K hash keys, which is 3%. Seems reasonable.

    In a new run, 25K potentially unique codes generate 280 matches on the hash key, with 19 codes turning out to be unique. That compares well to the alternative of doing a join between those 25K rows and the 845K rows in the main table.

    Paul

Viewing 15 posts - 16 through 30 (of 50 total)

You must be logged in to reply to this topic. Login to reply