Returning longest substring match in join

  • Imagine we have a table called PathMetaData that describes what's in certain directories (i.e., this directory belongs to Mark and is regarding Project Acme).  We also have a table called Files that lists all the files and subdirectories in these directories. 

    PathMetaData can contain information that might "overlap" multiple directories.  For example, we might have a couple entries like:

    \\One\Two
    \\One\Two\Buckle
    \\One\Two\Buckle\My\Shoe

    In the Files table we might have:

    \\One\Two\document.doc
    \\One\Two\Three\document.doc
    \\One\Two\Buckle
    \\One\Two\Buckle\My\Shoe\document.doc

    I would like to write a query that joins the Files table with PathMetaData table *at the deepest level*.  So even though \\One\Two and \\One\Two\Buckle in PathMetaData might both match \\One\Two\Buckle\document.doc, I want only the join resulting from \\One\Two\Buckle.

    These tables are based on some ideas for manipulating hierarchies in O'Reilly's "Transact-SQL Cookbook", so in both PathMetaData and Files the path is actually the clustered index, so I want to do these joins based on LIKE (especially because both of these tables will contain potentially millions of rows).  Here are some simplified versions of the tables and data:

    DECLARE @PathMetaData TABLE (

     PathID Int Identity,

     Path VarChar(128),

     Author VarChar(64),

     Project VarChar(64))

    DECLARE @Files TABLE (

     FileID Int Identity,

     FilePath VarChar(128))

    INSERT INTO @PathMetaData Values('\\One\Two\', '0100', '001')

    INSERT INTO @PathMetaData Values('\\One\Two\Buckle\', '0100', '002')

    INSERT INTO @PathMetaData Values('\\One\Two\Buckle\My\Shoe\', '0100', '004')

    INSERT INTO @PathMetaData Values('\\Three\Four\', '0200', '001')

    INSERT INTO @PathMetaData Values('\\Three\Four\Close\The\Door\', '0300', '005')

    INSERT INTO @Files VALUES('\\One\Two\document.doc')

    INSERT INTO @Files VALUES('\\One\Two\Three\document.doc')

    INSERT INTO @Files VALUES('\\One\Two\Buckle\document.doc')

    INSERT INTO @Files VALUES('\\One\Two\Buckle\My\document.doc')

    INSERT INTO @Files VALUES('\\Three\Four\document.doc')

    INSERT INTO @Files VALUES('\\Three\Four\Close\document.doc')

    INSERT INTO @Files VALUES('\\Three\Four\Close\The\Door\document.doc')

     

    Here's one query that I've been using and it gives correct results but surely there's a better way to write this:

    SELECT maxdepth.FilePath, pmd.* FROM

     (

     SELECT fi.FilePath, MAX(Len(pmd.Path)) MaxLen

      FROM @PathMetaData pmd JOIN @Files fi

       ON fi.FilePath LIKE pmd.Path + '%'

      GROUP BY fi.FilePath) MaxDepth JOIN @PathMetaData pmd

      ON MaxDepth.FilePath LIKE pmd.Path + '%' AND MaxDepth.MaxLen = Len(pmd.Path)

     ORDER BY maxdepth.FilePath

    Any thoughts on this would be greatly appreciated.

    Thanks

    Mark

  •  
    select maxdepth.FilePath, pmd.* FROM

    (

    SELECT fi.FilePath, MAX(pmd.Path) [Path]

    FROM @PathMetaData pmd
    JOIN @Files fi
    ON fi.FilePath LIKE pmd.Path + '%'
    GROUP BY fi.FilePath) maxdepth

    JOIN

    @PathMetaData pmd

    ON MaxDepth.Path = pmd.Path
    ORDER BY maxdepth.FilePath
     
    --this version of the would be more efficient with an indexed  calculated column:
    --better still, have separate path and filename columns in the table.
    --even better, remove the filepath literal from the @Files table
    --and replace it with a foreign key pointing to 
    --a narrow surrogate key on @PathMetaData.
     
    SELECT fi.FilePath, pmd.*
    FROM @PathMetaData pmd
    JOIN @Files fi
    ON left(fi.FilePath,len(fi.Filepath) + 1 - charindex('/',reverse(fi.FilePath))) = pmd.Path

    [edit: the second one requires an exact match for the longest subpath]

     

    Tim Wilkinson

    "If it doesn't work in practice, you're using the wrong theory"
    - Immanuel Kant

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply