mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.
Bravo, Mr. Magoo!
Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
Eric M Russell (10/7/2015)
mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.
Bravo, Mr. Magoo!
Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.
Yes, I was surprised how well it worked, and just think of all the things we couldn't do without XML support in SQL.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
mister.magoo (10/7/2015)
Eric M Russell (10/7/2015)
mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.
Bravo, Mr. Magoo!
Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.
Yes, I was surprised how well it worked, and just think of all the things we couldn't do without XML support in SQL.
I just tested this against a 15 million row table having 115 columns and found no hash collisions, so it appears solid.
"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho
Eric M Russell (10/7/2015)
mister.magoo (10/7/2015)
Eric M Russell (10/7/2015)
mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.
Bravo, Mr. Magoo!
Leveraging XQuery to serialize all the columns in each row is brilliant. When I test this against a large table, the execution plan shows that 90% of the cost is the table scan and the UDX operator (the part that serializes the XML) is only 2% of the cost. So it appears we can compute an MD5 or SHA hash dynamically across all columns in a row.
Yes, I was surprised how well it worked, and just think of all the things we couldn't do without XML support in SQL.
I just tested this against a 15 million row table having 115 columns and found no hash collisions, so it appears solid.
Good to know, thanks for testing that.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
mister.magoo (10/6/2015)
I don't know whether your setup would suit this, but how about converting to xml , then using hashbytes?
select MBT.refID,
hashbytes('MD5',(select MBT.* from (values(null))foo(bar) for xml auto)) as [Hash]
from MyBaseTable as MBT;
You need the "from (values(null))foo(bar)" clause to use xml auto, it serves no other purpose.
Brilliant! This is exactly why I posted here. I was hoping someone on the forum would come up with a new approach.
I would never have thought of converting to XML as part of the method.
Thank you! I have marked your post as the solution.
Kim Crosser (10/7/2015)
Brilliant! This is exactly why I posted here. I was hoping someone on the forum would come up with a new approach.I would never have thought of converting to XML as part of the method.
Thank you! I have marked your post as the solution.
Most welcome, I enjoyed the challenge.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
For anyone interested, I ran several tests on one of the actual data tables involved, containing 1,526,652 rows of data with 49 columns, containing an average of approximately 470 bytes of data per row. (Definitely NOT my design - this is a legacy third-party database from which I am converting data.)
The hard-coded express conversion, isnull, and concatenation ran in ~44 seconds, while Mister Magoo's generic solution completed in ~66 seconds (I did several runs, all were within 1-2 seconds of these numbers).
Given the amount of additional text and overhead generated by the conversion to XML (all the column name tags) that is very good performance.
I am really impressed with this approach for a few reasons:
1. It is generic and can be applied to ANY table without having to generate hard-coded expressions (tables with [n]varchar(max) or varbinary(max) columns may need to be truncated at 8000 characters)
2. It avoids the possibility of accidentally "missing" one of the columns in the hard-coded expressions
3. If a column is added to a table later, this will result in a different hash, where the hard-coded expressions would miss new columns
4. It solves a minor (unlikely) data pattern that the hard-coded approach could miss - Null columns in the hard-coded version were discarded, so any sequence of ('a'), (null,'a'), and ('a',null) would yield the same hash value (easily solvable by adding column delimiters to the hard-coded expression, but now not needed)
This is a very useful code pattern I expect to use in more than just this project.
Further "interest" comes from comparing XML to JSON in SQL 2016.
context avg_duration (ms) avg_cpu_time (ms)
xml 4147.091600 3987.400000
json 4406.513600 4334.400000
This was a test of 1 million rows from a narrow table (3 small columns), run 5 times for each query, with
checkpoint;
dbcc dropcleanbuffers;
dbcc freeproccache;
..between each test (i.e. twice)
That appears to show roughly 6% win for XML over JSON in this test for duration and 8% for cpu time.
MM
select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply