Join two tables that do not have matching keys

  • I have two tables like this:

    Items

    ItemID (PK)Title
    101iPhone 5
    102iPhone 6
    103Blackberry Storm
    104Samsung Galaxy S1
    105Samsung Galaxy S6

    Sales

    SaleID (PK)ReferenceTitle
    1103-B00Blackberry Storm
    2101-IK5iPhone 5
    3105-ISGSamsung Galaxy S6
    4104-UN3Samsung Galaxy S1
    5101-IK5iPhone 5
    6102-HWBiPhone 6
    7105-ISGSamsung Galaxy S6
    8101-IK5iPhone 5
    9102-HWBiPhone 6

    I want to join them using "ItemID" column in Items table and "Reference" column in Sales table. As you can see, the values of two columns are not matching, but the part before the hyphen in each value of "Reference" column is matched with values in "ItemID" column. How can I join them? I know it's odd to have two columns with such values, but I did not create these tables and must play with what's available. Thanks

  • Join on LEFT(3, reference) - assuming the nrs will always be less than a 1000.

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    This thing is addressing problems that dont exist. Its solution-ism at its worst. We are dumbing down machines that are inherently superior. - Gilfoyle

  • DECLARE @Ref VARCHAR(7) = '103-B00';
    SELECT LEFT(@Ref,CHARINDEX('-',@Ref,1)-1);

  • The problem is that isolation of the part of the column from your Sales table is going to need to be calculated for every row in the table.  This isn't every row in the result set (because it isn't built yet) but every row in the table.  As the number of rows goes up, so will the performance problems.  This is only of the problems with having values concatenated into one column instead of keeping the values separate.

    If you're stuck with the existing structure, can you add a  column?  Specifically, can you add a persisted computed column to hold the value of the foreign key?  If so, you'll isolate the calculation of your value to when the row is written and save yourself from having to calculate it over the whole table every time the tables are joined.  If not, then you'll save yourself some work in the future by using the approach pietlinden suggested.  The work you save will be when the values to isolate goes above 999, when you'll have to switch to finding the position of the delimiter anyway.

    The best way is to fix the table structure, but I understand if you're stuck with it.

  • This was removed by the editor as SPAM

  •  CREATE TABLE #items ---- Build test tables, credit to the great Jeff Moden for this bit.
    (
        ItemID INT IDENTITY(101,1)
        ,Title VARCHAR(6)
    )
    ALTER TABLE #items
    ADD CONSTRAINT PK_ItemID PRIMARY KEY CLUSTERED (ItemID) WITH FILLFACTOR = 90
    GO

    INSERT INTO #items
    SELECT TOP 100 -- Create 100 unique items
      Title   = CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))
      + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))
         + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))
          + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))
          + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))
          + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))

    FROM sys.all_columns ac1
    CROSS JOIN sys.all_columns ac2
    GO

    CREATE TABLE #sales
    (
      SaleID INT IDENTITY(1,1)
      ,Reference  VARCHAR(20)
      ,Title   VARCHAR(40)
    )

    ALTER TABLE #sales
    ADD CONSTRAINT PK_SaleID PRIMARY KEY CLUSTERED (SaleID) WITH FILLFACTOR = 90

    GO
    INSERT INTO #sales
     
    SELECT ---- Add 10000 sales for each item, 1000000 rows in total
    CAST(i.itemID AS VARCHAR(6)) + '-' + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))
      + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A'))
         + CHAR(ABS(CHECKSUM(NEWID())) % (ASCII('Z')-ASCII('A')+1) +ASCII('A')) AS Reference
      ,i.Title 
    FROM #Items i
    GO 10000

    /*
    The actual query starts here
    */
    SELECT
      i.ItemID
      ,i.Title
      ,der.CleanReference
      ,der.Reference
      ,der.Title
      ,der.SaleID
    FROM #Items i
    JOIN (
      SELECT
      CAST(LEFT(s.Reference,CHARINDEX('-',s.Reference)-1) AS INT) AS CleanReference
      ,s.Reference
      ,s.Title
      ,s.SaleID
      FROM #Sales s
     
    ) der ON der.CleanReference = i.ItemID
    DROP TABLE
        #Items
        ,#sales

    Does this do what you want?  I can't help thinking it's too obvious and there's a glaring flaw though,  The whole thing runs in about 22 seconds when the tables are built each time and in about 10 seconds when the tables aren't dropped.  That's on a crappy test box so on something with a bit of oomph it might be pretty quick.

    --- Edited to make the data more realistic


    On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
    —Charles Babbage, Passages from the Life of a Philosopher

    How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

  • SELECT *

    FROM Items i

    INNER JOIN Sales s

    ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'

    -- Retains SARGability of s.Reference

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • ChrisM@Work - Wednesday, March 22, 2017 6:25 AM

    SELECT *

    FROM Items i

    INNER JOIN Sales s

    ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'

    -- Retains SARGability of s.Reference

    I'm not sure what the advantages of that are Chris, I was under the impression that functions in JOINs was to be avoided.  It's considerably slower than my version on my laptop against the same million row tables.

    --Edited to correct a mistake in terminology.


    On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
    —Charles Babbage, Passages from the Life of a Philosopher

    How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

  • Neil Burton - Wednesday, March 22, 2017 6:53 AM

    ChrisM@Work - Wednesday, March 22, 2017 6:25 AM

    SELECT *

    FROM Items i

    INNER JOIN Sales s

    ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'

    -- Retains SARGability of s.Reference

    I'm not sure what the advantages of that are Chris, I was under the impression that functions in WHERE clauses were to be avoided.  It's considerably slower than my version on my laptop against the same million row tables.

    Put an index on s.reference and restrict the output columns to i.ItemID and s.Reference.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • CREATE INDEX ix_Stuff ON #sales (Reference) INCLUDE (Title)

    SELECT i.ItemID, s.Title

    FROM #Items i

    INNER JOIN #Sales s

    ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'

    -- 1,000,000 / 00:00:04

    Execution plan shows an index seek for #Sales.

    Edit: "blackholing" the output to eliminate return & display of the 1000000-row result set.

    SET

    STATISTICS TIME ON

    DECLARE @ItemID INT, @Title VARCHAR(40)

    SELECT @ItemID = i.ItemID, @Title = s.Title

    FROM #Items i

    INNER JOIN #Sales s

    ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'

    SET STATISTICS TIME OFF

    -- (1 row(s) affected)

    -- SQL Server Execution Times:

    -- CPU time = 1214 ms, elapsed time = 402 ms.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • I see. It's logical that a function on the right of an ON (which is what I meant when I said WHERE clause above) won't affect SARGability.

    It certainly speeds things up having the index in place.  Over the million rows my version still seems to edge it though.  When there is a WHERE clause in place mine is definitely faster.

    SELECT
    i.ItemID
    ,der.Reference
    FROM #Items i
    JOIN (
    SELECT
    CAST(LEFT(s.Reference,CHARINDEX('-',s.Reference)-1) AS INT) AS CleanReference
    ,s.Reference
    FROM #Sales s

    ) der ON der.CleanReference = i.ItemID
    WHERE der.Reference LIKE '10%'

    SELECT
    i.itemid
    ,s.reference
    FROM #Items i

    INNER JOIN #Sales s ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'
    WHERE s.Reference LIKE '10%'

    Out of those two queries mine is consistently much quicker.


    On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
    —Charles Babbage, Passages from the Life of a Philosopher

    How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

  • Neil Burton - Wednesday, March 22, 2017 8:33 AM

    I see. It's logical that a function on the right of an ON (which is what I meant when I said WHERE clause above) won't affect SARGability.

    It certainly speeds things up having the index in place.  Over the million rows my version still seems to edge it though.  When there is a WHERE clause in place mine is definitely faster.

    SELECT
    i.ItemID
    ,der.Reference
    FROM #Items i
    JOIN (
    SELECT
    CAST(LEFT(s.Reference,CHARINDEX('-',s.Reference)-1) AS INT) AS CleanReference
    ,s.Reference
    FROM #Sales s

    ) der ON der.CleanReference = i.ItemID
    WHERE der.Reference LIKE '10%'

    SELECT
    i.itemid
    ,s.reference
    FROM #Items i

    INNER JOIN #Sales s ON s.Reference LIKE CAST(i.ItemID AS VARCHAR(50)) + '-%'
    WHERE s.Reference LIKE '10%'

    Out of those two queries mine is consistently much quicker.

    It's quicker because a hash match will process 1000000 rows faster than 1000000 individual seeks. LIKE won't play with a hash join.
    If you change the row restriction to a filter on #Item (quite likely in the wild) you see the nested loops join (with seeks) overtakes the blunt-force hash join, by a significant amount:

    --Run each batch individually several times

    DECLARE@ItemID INT, @Reference VARCHAR(50),@Title VARCHAR(50) = (SELECT TOP 1Title FROM #Items)

    SET STATISTICS TIME ON

    SELECT

      @ItemID = i.ItemID

      ,@Reference= der.Reference

    FROM#Items i

    JOIN (

      SELECT

      CAST(LEFT(s.Reference,CHARINDEX('-',s.Reference)-1) AS INT)  ASCleanReference

      ,s.Reference

      FROM#Sales s

     

    )der ON der.CleanReference = i.ItemID

    WHERE i.Title= @Title

    SET STATISTICS TIME OFF

    --CPU time = 672 ms,  elapsed time = 134ms.

    GO

     

    DECLARE@ItemID INT, @Reference VARCHAR(50),@Title VARCHAR(50) = (SELECT TOP 1Title FROM #Items)

    SET STATISTICS TIME ON

    SELECT 

           @ItemID = i.itemid

           ,@Reference= s.reference

    FROM#Items i

    INNER JOIN#Sales s

           ON s.ReferenceLIKE CAST(i.ItemIDAS VARCHAR(50)) + '-%'

    WHERE i.Title= @Title

    SET STATISTICS TIME OFF

    --CPU time = 16 ms,  elapsed time = 9 ms.

    GO

     

    DECLARE@ItemID INT, @Reference VARCHAR(50),@Title VARCHAR(50) = (SELECT TOP 1Title FROM #Items)

    SET STATISTICS TIME ON

    SELECT 

           @ItemID = i.itemid

           ,@Reference= s.reference

    FROM#Items i

    INNER JOIN#Sales s

           ON s.ReferenceBETWEEN CAST(i.ItemIDAS VARCHAR(50)) AND CAST(i.ItemIDAS VARCHAR(50)) + '¬'

    WHERE i.Title= @Title

    SET STATISTICS TIME OFF

    --CPU time = 16 ms,  elapsed time = 5 ms.

    GO

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply