Performance Question

  • The checksum can be considered "Good enough".

    But you still will have to compensate for the duplicates; ie delete the duplicates and insert 217 new records (in your example), check them for duplicates and so on... Until exactly 1 million unique records are created.

    See my pseudocode in previous post. That's one example of how to do this fast and set-based.

    The first insert to a temp table uses a cross join similar to create a tally numbers table. The number of records created must be at least the number of records already existing in the Codes table, plus the number of new records wanted, to guarantee @NewRecordsWantedCount to be exact.

    Then insert TOP (@NewRecordsWantedCount) into Codes table with a NOT EXISTS check for already used/stored codes.

    Also, purely mathematical, there are a total of 5,079,110,400 combinations when you want 8 characters out of the original 20 and don't want duplicate character within code.

    If you don't care about duplicate character within code, there are 25,600,000,000 combinations total.

    So another way to do this is to pre-generate all possible combinations one and for all, and have a status column added to the table denoting "Free or taken". This way the "creation" of 1 million records would be a simple

    UPDATE TOP (1000000)

    Codes

    SET Taken = 1

    OUTPUT inserted.Code

    WHERE Taken = 0


    N 56°04'39.16"
    E 12°55'05.25"

  • Peso (7/10/2009)


    The checksum can be considered "Good enough".

    So it's not a "BAD choice" any more then? 😀

    Peso (7/10/2009)


    But you still will have to compensate for the duplicates; ie delete the duplicates and insert 217 new records (in your example), check them for duplicates and so on... Until exactly 1 million unique records are created.

    No. I would encourage you to run my posted script and then examine the execution plan to see how it works. I don't know where you get 217 from (there were only in fact 19 duplicates - from 280 matches on the hash - from 25K candidate codes.

    Peso (7/10/2009)


    See my pseudocode in previous post. That's one example of how to do this fast and set-based.

    The same code exists in my script! With the hash_code optimization added.

    Peso (7/10/2009)


    The first insert to a temp table uses a cross join similar to create a tally numbers table. The number of records created must be at least the number of records already existing in the Codes table, plus the number of new records wanted, to guarantee @NewRecordsWantedCount to be exact.

    Recall that the code-generation CLR function is expensive. Creating sufficient codes for a 1M row table is inifficient given the likelihood of collisions. My script uses an iterative approach with 'knobs and levers' to tune the query to the application. Note that more codes than requested are generated for exactly this reason. Once again, please run it.

    Paul

  • In fact I'll do it for you. Actual execution plan attached.

    Note that the @MaxGen tuning parameter is deliberately set low - at 100 - to allow the script to demonstrate multiple passes. This would not be desirable in practice; this is a demo.

    Even so, also note that on subsequent passes, if a small number of target codes are required SQL Server will choose a Hash Match Flow Distinct - so a minimal number of extra codes are in fact generated.

    Anyway, on to the plan:

    On this run, our aim is to insert 25K new unique codes into the table containing 125K unique codes already.

    The code generation part of the script generates 25,100 rows.

    Two of these rows match on the hash_code, so 25,002 rows flow through the first TOP.

    Both rows turn out to be real matches to existing codes.

    (Note that only 2 loop join singleton lookups are done to determine the matches.)

    The remaining 25,000 unique codes flow on to be inserted.

    So, to insert 25K new unique codes, we generate 25,100 codes, use 25,002 of them, and perform only 2 code comparisons, the rest of the work is carried out using the hash codes.

    Paul

  • More math on CHECKSUM.

    When you calculate the CHECKSUM for a 8-character string you utilizie the full INT (32 bit) without overflowing the cyclic XOR.

    It means the checksum for any given 8-character string, at least have 16,777,215 other strings giving the same checksum, if full ascii 0-255 is used.


    N 56°04'39.16"
    E 12°55'05.25"

  • Heh. Ok, so that was the version based on HashBytes using SHA1. My mistake.

    The attached plan is a run using CHECKSUM.

    This time 175K code existed already, we're still adding 25K new ones.

    This run happened to require a second execution - note the second run uses the Flow Distinct, so to generate the missing 27 rows, only 28 codes are generated.

    With actual execution plan turned off, and a warm data cache, my very ordinary single-core 2GHz Pentium 4 Mobile laptop completes the whole process of inserting 25K new unique codes into a table with 250K existing records in less than one second.

    Paul

  • Peso (7/10/2009)


    When you calculate the CHECKSUM for a 8-character string you utilizie the full INT (32 bit) without overflowing the cyclic XOR. It means the checksum for any given 8-character string, at least have 16,777,215 other strings giving the same checksum, if full ascii 0-255 is used.

    And how many other strings not having the same checksum? :laugh:

    Fact is, it works. Forget the theory (interesting as it is) - check the plans.

    BTW the codes are VARCHAR(10), and the length varies, so not sure why the 8-character thing is significant.

  • Hi Peso , i pretty much agree with you psuedo-code but your statement of..

    as long as the first creation guarantees unique output.

    Gets things a little circular 🙂

    Maybe when i get a chance ill play around with a full solution ie inserting the values to a table as a primary key, but i suspect that , iterating until total rows inserted = number required would be best. I admit that i havent even considered IGNORE_DUP_KEY , would be interesting to experiment with though.

    Im keen to hear back from the OP that we have even help to solve his issues.



    Clear Sky SQL
    My Blog[/url]

  • Something similar to this, if "random" also mean unique.

    The drawback is that the numbers generated per "batch" are similar, but still unique in the batch.

    DECLARE@Codes TABLE

    (

    Code CHAR(8) PRIMARY KEY CLUSTERED

    )

    DECLARE@Sample TABLE

    (

    ID INT PRIMARY KEY CLUSTERED,

    Char1 AS ((ID / POWER(CAST(19 AS BIGINT), 7)) % 19) PERSISTED,

    Char2 AS ((ID / POWER(CAST(19 AS BIGINT), 6)) % 19) PERSISTED,

    Char3 AS ((ID / POWER(CAST(19 AS BIGINT), 5)) % 19) PERSISTED,

    Char4 AS ((ID / POWER(CAST(19 AS BIGINT), 4)) % 19) PERSISTED,

    Char5 AS ((ID / POWER(CAST(19 AS BIGINT), 3)) % 19) PERSISTED,

    Char6 AS ((ID / POWER(CAST(19 AS BIGINT), 2)) % 19) PERSISTED,

    Char7 AS ((ID / POWER(CAST(19 AS BIGINT), 1)) % 19) PERSISTED,

    Char8 AS ((ID / POWER(CAST(19 AS BIGINT), 0)) % 19) PERSISTED,

    Code CHAR(8) NOT NULL

    )

    DECLARE@WantedCodes INT,

    @x INT

    SET@WantedCodes = 333

    SELECT@x = COUNT(*)

    FROM@Codes

    INSERT@Sample

    (

    ID,

    Code

    )

    SELECTNumber,

    ''

    FROMdbo.TallyNumbers

    WHERENumber < @x + @WantedCodes

    DECLARE@Token TABLE

    (

    ID TINYINT IDENTITY(0, 1) PRIMARY KEY CLUSTERED,

    CHR CHAR(1) NOT NULL

    )

    DECLARE@Chars CHAR(19)

    SET@Chars = '4679CDFGHJKLNPRTVXY'

    INSERT@Token

    (

    CHR

    )

    SELECTSUBSTRING(@Chars, Number + 1, 1) AS CHR

    FROMdbo.TallyNumbers

    WHERENumber < DATALENGTH(@Chars)

    ORDER BYNEWID()

    UPDATEs

    SETs.Code = t1.CHR + t2.CHR + t3.CHR + t4.CHR + t5.CHR + t6.CHR + t7.CHR + t8.CHR

    FROM@Sample AS s

    INNER JOIN@Token AS t1 ON t1.ID = s.Char1

    INNER JOIN@Token AS t2 ON t2.ID = s.Char2

    INNER JOIN@Token AS t3 ON t3.ID = s.Char3

    INNER JOIN@Token AS t4 ON t4.ID = s.Char4

    INNER JOIN@Token AS t5 ON t5.ID = s.Char5

    INNER JOIN@Token AS t6 ON t6.ID = s.Char6

    INNER JOIN@Token AS t7 ON t7.ID = s.Char7

    INNER JOIN@Token AS t8 ON t8.ID = s.Char8

    INSERT@Codes

    (

    Code

    )

    SELECTTOP (@WantedCodes)

    s.Code

    FROM@Sample AS s

    WHERENOT EXISTS (SELECT * FROM @Codes AS c WHERE c.Code = s.Code)

    SELECT*

    FROM@Codes


    N 56°04'39.16"
    E 12°55'05.25"

  • Peso (7/10/2009)


    Something similar to this, if "random" also mean unique.

    The drawback is that the numbers generated per "batch" are similar, but still unique in the batch.

    Adding the missing definition for dbo.TallyNumbers, and populating with 1M rows:

    CREATE TABLE dbo.TallyNumbers (Number INT NOT NULL PRIMARY KEY)

    INSERT dbo.TallyNumbers (Number)

    SELECTTOP (1000000)

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROMmaster.sys.all_columns C1, master.sys.all_columns C2, master.sys.all_columns C3

    Running the script produces:

    [font="Courier New"]Msg 2627, Level 14, State 1, Line 69

    Violation of PRIMARY KEY constraint 'PK__#164452B__A25C5AA6182C9B23'. Cannot insert duplicate key in object 'dbo.@Codes'.

    The statement has been terminated.

    [/font]

    :blink:

  • Your TallyNumbers table must begin with 0 (zero).


    N 56°04'39.16"
    E 12°55'05.25"

  • Peso (7/10/2009)


    Your TallyNumbers table must begin with 0 (zero).

    Sheesh.

    *exits*

  • If the CHECKSUM is deemed as acceptable function for hashing out values (1 chance in 16,777,215), using this code gives one chance in 3,047,466,240 to get a duplicate value!

    DECLARE @Codes TABLE

    (

    Code CHAR(8) PRIMARY KEY CLUSTERED

    )

    DECLARE @Sample TABLE

    (

    ID INT PRIMARY KEY CLUSTERED,

    Char1 AS ((ID / POWER(CAST(19 AS BIGINT), 7)) % 19) PERSISTED,

    Char2 AS ((ID / POWER(CAST(19 AS BIGINT), 6)) % 19) PERSISTED,

    Char3 AS ((ID / POWER(CAST(19 AS BIGINT), 5)) % 19) PERSISTED,

    Char4 AS ((ID / POWER(CAST(19 AS BIGINT), 4)) % 19) PERSISTED,

    Char5 AS ((ID / POWER(CAST(19 AS BIGINT), 3)) % 19) PERSISTED,

    Char6 AS ((ID / POWER(CAST(19 AS BIGINT), 2)) % 19) PERSISTED,

    Char7 AS ((ID / POWER(CAST(19 AS BIGINT), 1)) % 19) PERSISTED,

    Char8 AS ((ID / POWER(CAST(19 AS BIGINT), 0)) % 19) PERSISTED

    )

    DECLARE @WantedCodes INT

    SET @WantedCodes = 333

    INSERT @Sample

    (

    ID,

    Code

    )

    SELECT Number,

    ''

    FROM dbo.TallyNumbers

    WHERE Number < @WantedCodes

    DECLARE @Token TABLE

    (

    ID TINYINT IDENTITY(0, 1) PRIMARY KEY CLUSTERED,

    CHR CHAR(1) NOT NULL

    )

    DECLARE @Chars CHAR(19)

    SET @Chars = '4679CDFGHJKLNPRTVXY'

    INSERT @Token

    (

    CHR

    )

    SELECT SUBSTRING(@Chars, Number + 1, 1) AS CHR

    FROM dbo.TallyNumbers

    WHERE Number < DATALENGTH(@Chars)

    ORDER BY NEWID()

    INSERT Codes (Code)

    SELECT t1.CHR + t2.CHR + t3.CHR + t4.CHR + t5.CHR + t6.CHR + t7.CHR + t8.CHR

    FROM @Sample AS s

    INNER JOIN @Token AS t1 ON t1.ID = s.Char1

    INNER JOIN @Token AS t2 ON t2.ID = s.Char2

    INNER JOIN @Token AS t3 ON t3.ID = s.Char3

    INNER JOIN @Token AS t4 ON t4.ID = s.Char4

    INNER JOIN @Token AS t5 ON t5.ID = s.Char5

    INNER JOIN @Token AS t6 ON t6.ID = s.Char6

    INNER JOIN @Token AS t7 ON t7.ID = s.Char7

    INNER JOIN @Token AS t8 ON t8.ID = s.Char8


    N 56°04'39.16"
    E 12°55'05.25"

  • My Problem is that my base table has 130 Million codes

    and I have to check if the created 100 Million codes a unique

    against my 130 Millions codes and itself.

    Therefore all runs very slow

  • scziege (7/20/2009)


    My Problem is that my base table has 130 Million codes

    and I have to check if the created 100 Million codes a unique

    against my 130 Millions codes and itself.

    Therefore all runs very slow

    Welcome back , so which of the multiple hints and suggestions given to you have you tried ?



    Clear Sky SQL
    My Blog[/url]

  • Every hint, the problem ist that I have to check wether the key exists

    in a table of 130 Million rows and after that I have to create 100 Million

    unique keys.

    Regards

Viewing 15 posts - 31 through 45 (of 50 total)

You must be logged in to reply to this topic. Login to reply