Performance Question

Question

Performance Question

Viewing 15 posts - 31 through 45 (of 50 total)

You must be logged in to reply to this topic. Login to reply

SwePeso SSC-Dedicated Points: 39752 More actions · Answer 1

The checksum can be considered "Good enough".

But you still will have to compensate for the duplicates; ie delete the duplicates and insert 217 new records (in your example), check them for duplicates and so on... Until exactly 1 million unique records are created.

See my pseudocode in previous post. That's one example of how to do this fast and set-based.

The first insert to a temp table uses a cross join similar to create a tally numbers table. The number of records created must be at least the number of records already existing in the Codes table, plus the number of new records wanted, to guarantee @NewRecordsWantedCount to be exact.

Then insert TOP (@NewRecordsWantedCount) into Codes table with a NOT EXISTS check for already used/stored codes.

Also, purely mathematical, there are a total of 5,079,110,400 combinations when you want 8 characters out of the original 20 and don't want duplicate character within code.

If you don't care about duplicate character within code, there are 25,600,000,000 combinations total.

So another way to do this is to pre-generate all possible combinations one and for all, and have a status column added to the table denoting "Free or taken". This way the "creation" of 1 million records would be a simple

UPDATE TOP (1000000)

Codes

SET Taken = 1

OUTPUT inserted.Code

WHERE Taken = 0

N 56°04'39.16"
E 12°55'05.25"

Paul White SSC Guru Points: 150467 More actions · Answer 2

Peso (7/10/2009)
The checksum can be considered "Good enough".

So it's not a "BAD choice" any more then? 😀

Peso (7/10/2009)
But you still will have to compensate for the duplicates; ie delete the duplicates and insert 217 new records (in your example), check them for duplicates and so on... Until exactly 1 million unique records are created.

No. I would encourage you to run my posted script and then examine the execution plan to see how it works. I don't know where you get 217 from (there were only in fact 19 duplicates - from 280 matches on the hash - from 25K candidate codes.

Peso (7/10/2009)
See my pseudocode in previous post. That's one example of how to do this fast and set-based.

The same code exists in my script! With the hash_code optimization added.

Peso (7/10/2009)
The first insert to a temp table uses a cross join similar to create a tally numbers table. The number of records created must be at least the number of records already existing in the Codes table, plus the number of new records wanted, to guarantee @NewRecordsWantedCount to be exact.

Recall that the code-generation CLR function is expensive. Creating sufficient codes for a 1M row table is inifficient given the likelihood of collisions. My script uses an iterative approach with 'knobs and levers' to tune the query to the application. Note that more codes than requested are generated for exactly this reason. Once again, please run it.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150467 More actions · Answer 3

In fact I'll do it for you. Actual execution plan attached.

Note that the @MaxGen tuning parameter is deliberately set low - at 100 - to allow the script to demonstrate multiple passes. This would not be desirable in practice; this is a demo.

Even so, also note that on subsequent passes, if a small number of target codes are required SQL Server will choose a Hash Match Flow Distinct - so a minimal number of extra codes are in fact generated.

Anyway, on to the plan:

On this run, our aim is to insert 25K new unique codes into the table containing 125K unique codes already.

The code generation part of the script generates 25,100 rows.

Two of these rows match on the hash_code, so 25,002 rows flow through the first TOP.

Both rows turn out to be real matches to existing codes.

(Note that only 2 loop join singleton lookups are done to determine the matches.)

The remaining 25,000 unique codes flow on to be inserted.

So, to insert 25K new unique codes, we generate 25,100 codes, use 25,002 of them, and perform only 2 code comparisons, the rest of the work is carried out using the hash codes.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

SwePeso SSC-Dedicated Points: 39752 More actions · Answer 4

More math on CHECKSUM.

When you calculate the CHECKSUM for a 8-character string you utilizie the full INT (32 bit) without overflowing the cyclic XOR.

It means the checksum for any given 8-character string, at least have 16,777,215 other strings giving the same checksum, if full ascii 0-255 is used.

N 56°04'39.16"
E 12°55'05.25"

Paul White SSC Guru Points: 150467 More actions · Answer 5

Heh. Ok, so that was the version based on HashBytes using SHA1. My mistake.

The attached plan is a run using CHECKSUM.

This time 175K code existed already, we're still adding 25K new ones.

This run happened to require a second execution - note the second run uses the Flow Distinct, so to generate the missing 27 rows, only 28 codes are generated.

With actual execution plan turned off, and a warm data cache, my very ordinary single-core 2GHz Pentium 4 Mobile laptop completes the whole process of inserting 25K new unique codes into a table with 250K existing records in less than one second.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150467 More actions · Answer 6

Peso (7/10/2009)
When you calculate the CHECKSUM for a 8-character string you utilizie the full INT (32 bit) without overflowing the cyclic XOR. It means the checksum for any given 8-character string, at least have 16,777,215 other strings giving the same checksum, if full ascii 0-255 is used.

And how many other strings not having the same checksum? :laugh:

Fact is, it works. Forget the theory (interesting as it is) - check the plans.

BTW the codes are VARCHAR(10), and the length varies, so not sure why the 8-character thing is significant.

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Dave Ballantyne SSC-Dedicated Points: 33566 More actions · Answer 7

Hi Peso , i pretty much agree with you psuedo-code but your statement of..

as long as the first creation guarantees unique output.

Gets things a little circular 🙂

Maybe when i get a chance ill play around with a full solution ie inserting the values to a table as a primary key, but i suspect that , iterating until total rows inserted = number required would be best. I admit that i havent even considered IGNORE_DUP_KEY , would be interesting to experiment with though.

Im keen to hear back from the OP that we have even help to solve his issues.

Clear Sky SQL
My Blog[/url]

SwePeso SSC-Dedicated Points: 39752 More actions · Answer 8

Something similar to this, if "random" also mean unique.

The drawback is that the numbers generated per "batch" are similar, but still unique in the batch.

DECLARE@Codes TABLE

(

Code CHAR(8) PRIMARY KEY CLUSTERED

)

DECLARE@Sample TABLE

(

ID INT PRIMARY KEY CLUSTERED,

Char1 AS ((ID / POWER(CAST(19 AS BIGINT), 7)) % 19) PERSISTED,

Char2 AS ((ID / POWER(CAST(19 AS BIGINT), 6)) % 19) PERSISTED,

Char3 AS ((ID / POWER(CAST(19 AS BIGINT), 5)) % 19) PERSISTED,

Char4 AS ((ID / POWER(CAST(19 AS BIGINT), 4)) % 19) PERSISTED,

Char5 AS ((ID / POWER(CAST(19 AS BIGINT), 3)) % 19) PERSISTED,

Char6 AS ((ID / POWER(CAST(19 AS BIGINT), 2)) % 19) PERSISTED,

Char7 AS ((ID / POWER(CAST(19 AS BIGINT), 1)) % 19) PERSISTED,

Char8 AS ((ID / POWER(CAST(19 AS BIGINT), 0)) % 19) PERSISTED,

Code CHAR(8) NOT NULL

)

DECLARE@WantedCodes INT,

@x INT

SET@WantedCodes = 333

SELECT@x = COUNT(*)

FROM@Codes

INSERT@Sample

(

ID,

Code

)

SELECTNumber,

''

FROMdbo.TallyNumbers

WHERENumber < @x + @WantedCodes

DECLARE@Token TABLE

(

ID TINYINT IDENTITY(0, 1) PRIMARY KEY CLUSTERED,

CHR CHAR(1) NOT NULL

)

DECLARE@Chars CHAR(19)

SET@Chars = '4679CDFGHJKLNPRTVXY'

INSERT@Token

(

CHR

)

SELECTSUBSTRING(@Chars, Number + 1, 1) AS CHR

FROMdbo.TallyNumbers

WHERENumber < DATALENGTH(@Chars)

ORDER BYNEWID()

UPDATEs

SETs.Code = t1.CHR + t2.CHR + t3.CHR + t4.CHR + t5.CHR + t6.CHR + t7.CHR + t8.CHR

FROM@Sample AS s

INNER JOIN@Token AS t1 ON t1.ID = s.Char1

INNER JOIN@Token AS t2 ON t2.ID = s.Char2

INNER JOIN@Token AS t3 ON t3.ID = s.Char3

INNER JOIN@Token AS t4 ON t4.ID = s.Char4

INNER JOIN@Token AS t5 ON t5.ID = s.Char5

INNER JOIN@Token AS t6 ON t6.ID = s.Char6

INNER JOIN@Token AS t7 ON t7.ID = s.Char7

INNER JOIN@Token AS t8 ON t8.ID = s.Char8

INSERT@Codes

(

Code

)

SELECTTOP (@WantedCodes)

s.Code

FROM@Sample AS s

WHERENOT EXISTS (SELECT * FROM @Codes AS c WHERE c.Code = s.Code)

SELECT*

FROM@Codes

N 56°04'39.16"
E 12°55'05.25"

Paul White SSC Guru Points: 150467 More actions · Answer 9

Peso (7/10/2009)
Something similar to this, if "random" also mean unique.
The drawback is that the numbers generated per "batch" are similar, but still unique in the batch.

Adding the missing definition for dbo.TallyNumbers, and populating with 1M rows:

CREATE TABLE dbo.TallyNumbers (Number INT NOT NULL PRIMARY KEY)

INSERT dbo.TallyNumbers (Number)

SELECTTOP (1000000)

ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROMmaster.sys.all_columns C1, master.sys.all_columns C2, master.sys.all_columns C3

Running the script produces:

[font="Courier New"]Msg 2627, Level 14, State 1, Line 69

Violation of PRIMARY KEY constraint 'PK__#164452B__A25C5AA6182C9B23'. Cannot insert duplicate key in object 'dbo.@Codes'.

The statement has been terminated.

[/font]

:blink:

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

SwePeso SSC-Dedicated Points: 39752 More actions · Answer 10

Your TallyNumbers table must begin with 0 (zero).

N 56°04'39.16"
E 12°55'05.25"

Paul White SSC Guru Points: 150467 More actions · Answer 11

Peso (7/10/2009)
Your TallyNumbers table must begin with 0 (zero).

Sheesh.

*exits*

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

SwePeso SSC-Dedicated Points: 39752 More actions · Answer 12

If the CHECKSUM is deemed as acceptable function for hashing out values (1 chance in 16,777,215), using this code gives one chance in 3,047,466,240 to get a duplicate value!

DECLARE @Codes TABLE

(

Code CHAR(8) PRIMARY KEY CLUSTERED

)

DECLARE @Sample TABLE

(

ID INT PRIMARY KEY CLUSTERED,

Char1 AS ((ID / POWER(CAST(19 AS BIGINT), 7)) % 19) PERSISTED,

Char2 AS ((ID / POWER(CAST(19 AS BIGINT), 6)) % 19) PERSISTED,

Char3 AS ((ID / POWER(CAST(19 AS BIGINT), 5)) % 19) PERSISTED,

Char4 AS ((ID / POWER(CAST(19 AS BIGINT), 4)) % 19) PERSISTED,

Char5 AS ((ID / POWER(CAST(19 AS BIGINT), 3)) % 19) PERSISTED,

Char6 AS ((ID / POWER(CAST(19 AS BIGINT), 2)) % 19) PERSISTED,

Char7 AS ((ID / POWER(CAST(19 AS BIGINT), 1)) % 19) PERSISTED,

Char8 AS ((ID / POWER(CAST(19 AS BIGINT), 0)) % 19) PERSISTED

)

DECLARE @WantedCodes INT

SET @WantedCodes = 333

INSERT @Sample

(

ID,

Code

)

SELECT Number,

''

FROM dbo.TallyNumbers

WHERE Number < @WantedCodes

DECLARE @Token TABLE

(

ID TINYINT IDENTITY(0, 1) PRIMARY KEY CLUSTERED,

CHR CHAR(1) NOT NULL

)

DECLARE @Chars CHAR(19)

SET @Chars = '4679CDFGHJKLNPRTVXY'

INSERT @Token

(

CHR

)

SELECT SUBSTRING(@Chars, Number + 1, 1) AS CHR

FROM dbo.TallyNumbers

WHERE Number < DATALENGTH(@Chars)

ORDER BY NEWID()

INSERT Codes (Code)

SELECT t1.CHR + t2.CHR + t3.CHR + t4.CHR + t5.CHR + t6.CHR + t7.CHR + t8.CHR

FROM @Sample AS s

INNER JOIN @Token AS t1 ON t1.ID = s.Char1

INNER JOIN @Token AS t2 ON t2.ID = s.Char2

INNER JOIN @Token AS t3 ON t3.ID = s.Char3

INNER JOIN @Token AS t4 ON t4.ID = s.Char4

INNER JOIN @Token AS t5 ON t5.ID = s.Char5

INNER JOIN @Token AS t6 ON t6.ID = s.Char6

INNER JOIN @Token AS t7 ON t7.ID = s.Char7

INNER JOIN @Token AS t8 ON t8.ID = s.Char8

N 56°04'39.16"
E 12°55'05.25"

scziege SSC Veteran Points: 247 More actions · Answer 13

My Problem is that my base table has 130 Million codes

and I have to check if the created 100 Million codes a unique

against my 130 Millions codes and itself.

Therefore all runs very slow

Dave Ballantyne SSC-Dedicated Points: 33566 More actions · Answer 14

scziege (7/20/2009)
My Problem is that my base table has 130 Million codes
and I have to check if the created 100 Million codes a unique
against my 130 Millions codes and itself.
Therefore all runs very slow

Welcome back , so which of the multiple hints and suggestions given to you have you tried ?

Clear Sky SQL
My Blog[/url]

scziege SSC Veteran Points: 247 More actions · Answer 15

Every hint, the problem ist that I have to check wether the key exists

in a table of 130 Million rows and after that I have to create 100 Million

unique keys.

Regards