STRINGS - Common portion of a string within a group....

  • Here's the grief: How might you go about determining, using T-SQL queries, the portion of all strings within a group of records (only 1 field involved, and always starting from the 1st character) that are in common?

    The data represents file paths, and I need to identify for each group of records (group based on another field in the table), what all values for a given string field have in common for that group, starting at the 1st character of the field. There is no need to seek characters in common beyond the 1st character that differs for any given comparison between 2 records.

    This just sounds like one rather painful task, and I'm seeking an approach that might be easily implemented. Anyone?

    Steve

    (aka sgmunson)

    :w00t::w00t::w00t:

    Steve (aka sgmunson) πŸ™‚ πŸ™‚ πŸ™‚
    Rent Servers for Income (picks and shovels strategy)

  • Hi Steve

    Any chance of laying this out visually, mate? I reckon I know what you mean, but a picture would really help.

    Cheers

    ChrisM

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • It's like this - pretend your data is like this:

    GRP_FLD PATH_FLD

    AABBCCD \\server1\share1\apath1\abc

    AABBCCD \\server1\share1\apath1\def

    AABBCCD \\server1\share1\apath1\ghi

    EEFFGGH \\server2\share2\apath1\abc

    EEFFGGH \\server2\share2\bpath2\abc

    What is needed is for each value in the GRP_FLD, determine the characters in the PATH_FLD that are in common for ALL records where GRP_FLD has that value. Given the above data, here's the expected result:

    GRP_FLD PATH_FLD

    AABBCCD \\server1\share1\apath1EEFFGGH \\server2\share2

    Does that make sense? Also, because these are file paths, what is in common has to end in a backslash, so even if two records in the same group were to look like:

    \\server1\share1\abcdef_ghi

    \\server1\share1\abcdef_jkl

    The items in common would not be: \\server1\share1\abcdef_

    but instead would be: \\server1\share1

    Does that clarify?

    Steve

    (aka sgmunson)

    :-):-):-)

    Steve (aka sgmunson) πŸ™‚ πŸ™‚ πŸ™‚
    Rent Servers for Income (picks and shovels strategy)

  • Yeah that's awesome, thanks Steve.

    This is what I'd do: break each string into words (as columns) using the string-splitter of your choice. First delimiter is space-backslash-backslash, subsequent delimiters are backslash. Run an aggregate against the result, grouping on the new columns. This gives 5 new columns in your sample data set. In theory it could be much higher, in practice it's ... heh have a look and see πŸ™‚

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Ok, that's a pretty decent idea, but then I have to see which group level is common across the group, which amounts to the same problem I have to begin with. Thus I've opted for something a little simpler. I realized that within a group, the maximum and minimum file path string values are sufficient for a comparison, so I used the following function to do the comparison and return the portion of the file path I'm looking for, and sure enough, it worked like a charm...

    Here's the function I used:

    CREATE FUNCTION dbo.fn_FILEPATH_COMPARE(

    @PATH1 AS varchar(500),

    @PATH2 AS varchar(500)

    )

    RETURNS varchar(500)

    AS

    BEGIN

    DECLARE @POS AS INT, @LENP1 AS INT, @LENP2 AS INT, @RESULT AS varchar(500)

    DECLARE @MAX_POS AS INT, @SPOS AS INT

    SET @PATH1 = LOWER(@PATH1)

    SET @PATH2 = LOWER(@PATH2)

    SET @POS = 1

    SET @LENP1 = LEN(@PATH1)

    SET @LENP2 = LEN(@PATH2)

    IF @LENP1 <= @LENP2

    BEGIN

    SET @MAX_POS = @LENP1

    END

    ELSE

    BEGIN

    SET @MAX_POS = @LENP2

    END

    SET @SPOS = 0

    WHILE @POS <= @MAX_POS

    BEGIN

    IF SUBSTRING(@PATH1, @POS, 1) = '\'

    BEGIN

    SET @SPOS = @POS

    END

    IF SUBSTRING(@PATH1, @POS, 1) = SUBSTRING(@PATH2, @POS, 1)

    BEGIN

    SET @POS = @POS + 1

    END

    ELSE

    BEGIN

    SET @MAX_POS = @POS - 1

    END

    END

    SET @RESULT = SUBSTRING(@PATH1, 1, @POS)

    IF RIGHT(@RESULT,1) <> '\'

    BEGIN

    SET @RESULT = LEFT(@RESULT, @SPOS)

    END

    RETURN @RESULT

    END

    Please don't just bash the while loop... it may be a necessity, and having already eliminated what I thought would be a rather huge number of record comparisons down to just one per group, I've eliminated 99% of the performance problem right away. On the server involved, producing 56 output records from a source with over 3000 showed zero elapsed time. Even if this were scaled up 10 times, the result would likely remain nearly instantaneous, and even if it took 30 seconds, that would be no big deal, as this is more of a one-off than some kinid of production process.

    Thanks for the inspiration Chris, as just hearing about the idea of splitting the path into sections got me thinking about the comparison process, and that was the key to realiziing that there only needed to be one comparison per group - minimum vs. maximum. This then short-circuited worrying about how many comparisons there would be, so just getting down and dirty and going char to char in the strings would be more than adequate.

    Steve

    (aka sgmunson)

    :-):-):-)

    Steve (aka sgmunson) πŸ™‚ πŸ™‚ πŸ™‚
    Rent Servers for Income (picks and shovels strategy)

  • Since you're on 2008, and dealing with file system paths, I'm wondering if you can do something with the new hierarchyid data type.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • A set-based approach (i.e. getting rid of those while loops) for doing this would be:

    declare @Path1 varchar(100), @Path2 varchar(100), @PathLength int

    select @Path1 = '\\server1\share1\apath1\abc',

    @Path2 = '\\server1\share1\apath1\def',

    @PathLength = LEN(@Path1),

    @PathLength = CASE WHEN LEN(@Path1) > LEN(@Path2) THEN LEN(@Path1) ELSE LEN(@Path2) END

    -- See Jeff Moden's article The "Numbers" or "Tally" Table: What it is and how it replaces a loop.

    -- at http://www.sqlservercentral.com/articles/T-SQL/62867/.

    ;WITH

    Tens (N) AS (SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL

    SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0 UNION ALL SELECT 0),

    Thousands(N) AS (SELECT 1 FROM Tens t1 CROSS JOIN Tens t2 CROSS JOIN Tens t3),

    Millions (N) AS (SELECT 1 FROM Thousands t1 CROSS JOIN Thousands t2),

    Tally (N) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM Millions),

    Diff (N) AS (SELECT MIN(N) FROM Tally

    WHERE SUBSTRING(@Path1, N, 1) <> SUBSTRING(@Path2, N, 1)

    AND N <= @PathLength)

    SELECT LEFT(@Path1, N-1) --<<<<< change to -2 if you don't want the trailing backslash >>>>>--

    FROM Diff

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Wayne,

    My WITH clause that precedes the function calls had to be prefixed to your code, and as yours has to do ALL the extra comparisons, it loses on performance by a huge factor (10016 ms vs. just 46 ms for my code). Once I took out the Millions CTE, yours got down to just 90 ms, which is still almost twice the time my code takes. Sometimes set-based is no match for a well designed loop. Admittedly, I don't see such circumstances often, but this is another one to add to the list. Thanks for your efforts!

    Steve

    (aka sgmunson)

    :-):-):-)

    Steve (aka sgmunson) πŸ™‚ πŸ™‚ πŸ™‚
    Rent Servers for Income (picks and shovels strategy)

  • Steve,

    Thanks for the comparison.

    Note that I include the in-line tally table so that it will run for folks without a real one. If you have a permanent one, yank out all of the code that builds it (just leave the code that references it), and it should be a lot faster. (A few months back, doing just this took a 6hr process to 13 seconds.) (Or, just prefix the schema (dbo.) in front of tally in the line it's being selected from.)

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply