Find a sub string in a column based on values in a reference table

  • Good morning, 
      I am not quite sure how to approach the problem. I have a table for customer and vendor with addresses in them. In many cases the city name is not in the right place. I have a table with a list of valid cities by country.

    What I want to do is for each customer or vendor look in one column and see if there is a city name in there based on my reference table.

    So if in Address it has '123 Ocean BL Long Beach' I want to be able to find Long Beach in there.

    Thoughts on how I go after this?

    TIA

  • SELECT v.Address, c.City
    FROM Vendors v
    LEFT JOIN Cities c ON v.Address = '%' + c.City + '%'

    Edit:- added c.City to SELECT list.

    John

  • John Mitchell-245523 - Thursday, December 21, 2017 8:41 AM

    SELECT v.Address
    FROM Vendors v
    LEFT JOIN Cities c ON v.Address = '%' + c.City + '%'

    John

    Important that you understand that with a large number of rows, this will not perform well, because a normal index will not help.   You'd need a FULL TEXT INDEX to be able to get that kind of search to perform better, which would take a rather sizable chunk of disk space and a lot of complexity.   However, if this is a one-time search, then it takes as long as it takes.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I figured performance would be sluggish. 
    This isn't a regular production run. We are moving data to a new system and would like to clean some stuff up. The legacy systems did not account for non USA based addresses so a lot of them are buggered up with city in the address line and country in the state or zip code line. 
    Performance is not an issue. It is being run on a non prod server and if it takes a week to run that is fine.

  • This is not working with the % but it does work without it. 

    SELECT   ,[Name]  ,[Address]  ,[Address2]  ,c.[City]  ,r.[city_name]  ,[State]  ,[ZipCode]  ,[TerritoryCode]  ,[CountryCode]
    FROM [Customer] c
    left join [xRefCityNames] r ON c.Address2 = r.[city_name]
    where c.city = '' and address2 = 'new delhi'

    This works but when I use   ON c.Address2 = '%' + r.[city_name] + '%' 
    I get NULL for city_name. 

    Thoughts?

  • fparker 20089 - Thursday, December 21, 2017 9:31 AM

    This is not working with the % but it does work without it. 

    SELECT   ,[Name]  ,[Address]  ,[Address2]  ,c.[City]  ,r.[city_name]  ,[State]  ,[ZipCode]  ,[TerritoryCode]  ,[CountryCode]
    FROM [Customer] c
    left join [xRefCityNames] r ON c.Address2 = r.[city_name]
    where c.city = '' and address2 = 'new delhi'

    This works but when I use   ON c.Address2 = '%' + r.[city_name] + '%' 
    I get NULL for city_name. 

    Thoughts?

    Yes, it should be LIKE not =
     ON c.Address2 LIKE '%' + r.[city_name] + '%' 

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • That does work except I will get 2 results for that one. I have both New Dehli and Dehli as cities which are both in my city list for India. Not bad for a cleanup effort though.

  • fparker 20089 - Thursday, December 21, 2017 9:52 AM

    That does work except I will get 2 results for that one. I have both New Dehli and Dehli as cities which are both in my city list for India. Not bad for a cleanup effort though.

    That is unavoidable when trying to clean up data.  You will get false positives and may miss records.  You could rewrite your query to only give one, but that might cause you to miss more records.

    Here is one rewrite that will only ever give you one possible match.
    SELECT [Name] ,[Address] ,[Address2] ,c.[City] ,r.[city_name] ,[State] ,[ZipCode] ,[TerritoryCode] ,[CountryCode]
    FROM [Customer] c
    CROSS APPLY
    (
        SELECT TOP(1) r.city_name
        FROM [xRefCityNames] r
        WHERE c.Address2 LIKE '%' + r.[city_name] + '%'
        ORDER BY LEN(r.city_name) DESC
    ) r

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Thanks a bunch. I think I am good, this gives me a better look than I had before. Some of these will need to be eyeballed anyway so having a few extras won't hurt. 

    I think taking that top on the longest one will work.

  • fparker 20089 - Thursday, December 21, 2017 10:17 AM

    Thanks a bunch. I think I am good, this gives me a better look than I had before. Some of these will need to be eyeballed anyway so having a few extras won't hurt. 

    I think taking that top on the longest one will work.

    That's exactly what the CROSS APPLY is doing.  You can also use a CTE with ROW_NUMBER().

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Is it possible to turn this around and get city names  as a substring that are not in my database at all?

  • fparker 20089 - Tuesday, January 2, 2018 8:49 AM

    Is it possible to turn this around and get city names  as a substring that are not in my database at all?

    I'm not sure - are there any delimiters in the string?  If not, how are you going to know what part of the string represents the city?

    John

  • I would just use the entire string in the city field. I could possibly also read up to a comma in the field as in Springfield, MO. I was also thinking the same thing for postal code in the postal code field.
    I have a large number of systems with duplicate data and I am trying to match it up and also to validate it.

  • fparker 20089 - Thursday, December 21, 2017 9:52 AM

    That does work except I will get 2 results for that one. I have both New Dehli and Dehli as cities which are both in my city list for India. Not bad for a cleanup effort though.

    EDIT: In my excitement to contribute, I didn't notice that Drew had essentially the same solution!  Apologies!

    I had this EXACT dilemma with something this week (in my case it was finding vendor names in a string) and almost posted here as well until I came up with a solution that should also work for you.  I use PATINDEX and the trick is that when I find more than one match, I rank the matches in descending order by the length of the city name I'm trying to match.  If I match the longest version, then that's the one I use.  This solved my problem so hopefully it will help you as well!

    Crude tables below simply for demonstration purposes.


    -- Create a sample City table
    IF OBJECT_ID('tempdb..#CityList', 'U') IS NOT NULL
      DROP TABLE #CityList

    CREATE TABLE #CityList
      (CityName VARCHAR(50),
      StateName VARCHAR(50))

    INSERT INTO #CityList
    VALUES ('Acton', 'Massachusetts'),
       ('West Acton', 'Massachusetts'),
       ('San Francisco', 'California'),
       ('Portland', 'Oregon')

    -- Create a sample Customer table
    IF OBJECT_ID('tempdb..#Customer', 'U') IS NOT NULL
      DROP TABLE #Customer

    CREATE TABLE #Customer
      (CustomerID INT,
      CustomerAddress VARCHAR(100))

    INSERT INTO #Customer
    VALUES (1, '1000 Avalon Dr, Acton'),
       (2, '1315 NW 11th Ave, Portland'),
       (3, '900 Stanton Blvd., West Acton'),
       (4, '53 Brook St, Acton'),
       (5, '100 Polk St, San Francisco'),
       (6, '355 Berry St, San Francisco'),
       (7, '117 Central St, Acton'),
       (8, '685 Belmont St, Portland'),
       (9, '925 Brooks, West Acton'),
       (10, '923 Folsom, San Francisco')

    -- Find the correct city based on CutsomerAddress
    ; WITH CustomerAddressMatch AS
    (SELECT c.CustomerID,
       c.CustomerAddress,
       cl.CityName,
       CityNameLength = LEN(cl.CityName),
       MatchRank = ROW_NUMBER() OVER (PARTITION BY c.CustomerID ORDER BY LEN(cl.CityName) DESC)
    FROM #Customer c
    CROSS JOIN #CityList cl
    WHERE PATINDEX('%' + cl.CityName + '%', c.CustomerAddress) > 0
    )

    -- Just select from the cte to review the details for the logic
    -- SELECT * FROM CustomerAddressMatch

    SELECT *
    FROM CustomerAddressMatch
    WHERE MatchRank = 1

    Results:

    Good luck!
    Lisa

  • Thanks a bunch! One can never have too many tools in the toolbox

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply