STRINGS - Common portion of a string within a group....

Question

Post reply

STRINGS - Common portion of a string within a group....

sgmunson

SSC Guru

Points: 110639
More actions
June 14, 2010 at 8:04 am

#236437

Here's the grief: How might you go about determining, using T-SQL queries, the portion of all strings within a group of records (only 1 field involved, and always starting from the 1st character) that are in common?
The data represents file paths, and I need to identify for each group of records (group based on another field in the table), what all values for a given string field have in common for that group, starting at the 1st character of the field. There is no need to seek characters in common beyond the 1st character that differs for any given comparison between 2 records.
This just sounds like one rather painful task, and I'm seeking an approach that might be easily implemented. Anyone?
Steve
(aka sgmunson)
:w00t::w00t::w00t:
Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 1

Hi Steve

Any chance of laying this out visually, mate? I reckon I know what you mean, but a picture would really help.

Cheers

ChrisM

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

sgmunson SSC Guru Points: 110639 More actions · Answer 2

It's like this - pretend your data is like this:

GRP_FLD PATH_FLD

AABBCCD \\server1\share1\apath1\abc

AABBCCD \\server1\share1\apath1\def

AABBCCD \\server1\share1\apath1\ghi

EEFFGGH \\server2\share2\apath1\abc

EEFFGGH \\server2\share2\bpath2\abc

What is needed is for each value in the GRP_FLD, determine the characters in the PATH_FLD that are in common for ALL records where GRP_FLD has that value. Given the above data, here's the expected result:

GRP_FLD PATH_FLD

AABBCCD \\server1\share1\apath1EEFFGGH \\server2\share2

Does that make sense? Also, because these are file paths, what is in common has to end in a backslash, so even if two records in the same group were to look like:

\\server1\share1\abcdef_ghi

\\server1\share1\abcdef_jkl

The items in common would not be: \\server1\share1\abcdef_

but instead would be: \\server1\share1

Does that clarify?

Steve

(aka sgmunson)

:-):-):-)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

Yeah that's awesome, thanks Steve.

This is what I'd do: break each string into words (as columns) using the string-splitter of your choice. First delimiter is space-backslash-backslash, subsequent delimiters are backslash. Run an aggregate against the result, grouping on the new columns. This gives 5 new columns in your sample data set. In theory it could be much higher, in practice it's ... heh have a look and see 🙂

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

sgmunson SSC Guru Points: 110639 More actions · Answer 4

Ok, that's a pretty decent idea, but then I have to see which group level is common across the group, which amounts to the same problem I have to begin with. Thus I've opted for something a little simpler. I realized that within a group, the maximum and minimum file path string values are sufficient for a comparison, so I used the following function to do the comparison and return the portion of the file path I'm looking for, and sure enough, it worked like a charm...

Here's the function I used:

CREATE FUNCTION dbo.fn_FILEPATH_COMPARE(

@PATH1 AS varchar(500),

@PATH2 AS varchar(500)

)

RETURNS varchar(500)

AS

BEGIN

DECLARE @POS AS INT, @LENP1 AS INT, @LENP2 AS INT, @RESULT AS varchar(500)

DECLARE @MAX_POS AS INT, @SPOS AS INT

SET @PATH1 = LOWER(@PATH1)

SET @PATH2 = LOWER(@PATH2)

SET @POS = 1

SET @LENP1 = LEN(@PATH1)

SET @LENP2 = LEN(@PATH2)

IF @LENP1 <= @LENP2

BEGIN

SET @MAX_POS = @LENP1

END

ELSE

BEGIN

SET @MAX_POS = @LENP2

END

SET @SPOS = 0

WHILE @POS <= @MAX_POS

BEGIN

IF SUBSTRING(@PATH1, @POS, 1) = '\'

BEGIN

SET @SPOS = @POS

END

IF SUBSTRING(@PATH1, @POS, 1) = SUBSTRING(@PATH2, @POS, 1)

BEGIN

SET @POS = @POS + 1

END

ELSE

BEGIN

SET @MAX_POS = @POS - 1

END

SET @RESULT = SUBSTRING(@PATH1, 1, @POS)

IF RIGHT(@RESULT,1) <> '\'

BEGIN

SET @RESULT = LEFT(@RESULT, @SPOS)

END

RETURN @RESULT

END

Please don't just bash the while loop... it may be a necessity, and having already eliminated what I thought would be a rather huge number of record comparisons down to just one per group, I've eliminated 99% of the performance problem right away. On the server involved, producing 56 output records from a source with over 3000 showed zero elapsed time. Even if this were scaled up 10 times, the result would likely remain nearly instantaneous, and even if it took 30 seconds, that would be no big deal, as this is more of a one-off than some kinid of production process.

Thanks for the inspiration Chris, as just hearing about the idea of splitting the path into sections got me thinking about the comparison process, and that was the key to realiziing that there only needed to be one comparison per group - minimum vs. maximum. This then short-circuited worrying about how many comparisons there would be, so just getting down and dirty and going char to char in the strings would be more than adequate.

Steve

(aka sgmunson)

:-):-):-)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

WayneS SSC Guru Points: 95409 More actions · Answer 5

Since you're on 2008, and dealing with file system paths, I'm wondering if you can do something with the new hierarchyid data type.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

WayneS SSC Guru Points: 95409 More actions · Answer 6

A set-based approach (i.e. getting rid of those while loops) for doing this would be:

declare @Path1 varchar(100), @Path2 varchar(100), @PathLength int

select @Path1 = '\\server1\share1\apath1\abc',

@Path2 = '\\server1\share1\apath1\def',

@PathLength = LEN(@Path1),

@PathLength = CASE WHEN LEN(@Path1) > LEN(@Path2) THEN LEN(@Path1) ELSE LEN(@Path2) END

-- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

-- at http://www.sqlservercentral.com/articles/T-SQL/62867/.

;WITH

Tens (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

Thousands(N) AS (SELECT 1 FROM Tens t1 CROSS JOIN Tens t2 CROSS JOIN Tens t3),

Millions (N) AS (SELECT 1 FROM Thousands t1 CROSS JOIN Thousands t2),

Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM Millions),

Diff (N) AS (SELECT MIN(N) FROM Tally

WHERE SUBSTRING(@Path1, N, 1) <> SUBSTRING(@Path2, N, 1)

AND N <= @PathLength)

SELECT LEFT(@Path1, N-1) --<<<<< change to -2 if you don't want the trailing backslash >>>>>--

FROM Diff

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

sgmunson SSC Guru Points: 110639 More actions · Answer 7

Wayne,

My WITH clause that precedes the function calls had to be prefixed to your code, and as yours has to do ALL the extra comparisons, it loses on performance by a huge factor (10016 ms vs. just 46 ms for my code). Once I took out the Millions CTE, yours got down to just 90 ms, which is still almost twice the time my code takes. Sometimes set-based is no match for a well designed loop. Admittedly, I don't see such circumstances often, but this is another one to add to the list. Thanks for your efforts!

Steve

(aka sgmunson)

:-):-):-)

Steve (aka sgmunson) 🙂 🙂 🙂
Rent Servers for Income (picks and shovels strategy)

WayneS SSC Guru Points: 95409 More actions · Answer 8

Steve,

Thanks for the comparison.

Note that I include the in-line tally table so that it will run for folks without a real one. If you have a permanent one, yank out all of the code that builds it (just leave the code that references it), and it should be a lot faster. (A few months back, doing just this took a 6hr process to 13 seconds.) (Or, just prefix the schema (dbo.) in front of tally in the line it's being selected from.)

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2