Character matching into patterns plus counts

  • In my case I am trying to see if there has been anyone who has devised a way to obtain an output that would traverse through all records for a given column and find all the patterns (whether it returns REGEX or the VALUES), then count how many times that exists. I know a lot of 1 counts would come back, so I more than likely would include a HAVING > 5 for instance. I do realize I need to have this formatted a certain way and have failed to do so, but I'm under the gun and promise the next time I will format correctly (or come back and edit this).

    Table name = dbo.PEOPLE

    Columns =

    LogID [int] IDENTITY(1,1) NOT NULL

    ,SourceData [varchar] (50) NOT NULL --which could include values that have numeric, alpha, and special characters

    ,Account [nvarchar] (50) NULL

    EXAMPLE DATA SET of table (pipe delimited)

    1|123X-456|56789

    2|456/123|A345

    3|ABC 345|X567

    4|456 HELP-123|YCHB

    5|456/987|2345

    6|CDE 345|XX987

    7|345X-456|YY6767

    So what I would like my output to look like is two columns - PATTERN|COUNT

    %X-456|2 --pulling in LogID rows 1 and 7, finding the like pattern and omitting everything else replacing with wildcard

    4%|3 --pulling in LogID rows 2,4 & 5

    45%|3 --pulling in LogID rows 2,4 & 5

    456%|3 --pulling in LogID rows 2,4 & 5

    456/%|2 --pulling in LogID rows 2 and 5

    % 345|2 --pulling in LogID rows 3 and 6

    %-4%|2 --pulling in LogID rows 1 and 7

    %-45%|2 --pulling in LogID rows 1 and 7

    %-456|2 --pulling in LogID rows 1 and 7

    Now, if someone has something that returns REGEX pattern vs. actual pattern values, then I would work with that.

    Essentially I figured I would need to traverse through each character of a field, after reading the LEN of the field, and starting with position/char 1 reading each character until the end and writing each character to a TEMP with a cursor/CTE. Then coming back through and seeing how many times I can match that pattern where it is, for instance > 5 as stated at first. I know the execution plan will be robust and consume a lot of my SQL server resources, but it has to be easier than hiring people to find patterns.

  • Since your using 2005...

    DECLARE @people TABLE (Logid INT IDENTITY(1,1), SourceData NVARCHAR(50), Account NVARCHAR(50))

    INSERT INTO @people (SourceData, Account)

    SELECT '123X-456','56789' UNION ALL

    SELECT '456/123','A345' UNION ALL

    SELECT 'ABC 345','X567' UNION ALL

    SELECT '456 HELP-123','YCHB' UNION ALL

    SELECT '456/987','2345' UNION ALL

    SELECT 'CDE 345','XX987' UNION ALL

    SELECT '345X-456','YY6767'

    SELECT * FROM @people

    You're welcome. 😉


    SELECT quote FROM brain WHERE original = 1
    0 rows returned

  • Here's what I came up with. You'll have to populate a table with your search patterns.

    DECLARE @people TABLE (Logid INT IDENTITY(1,1), SourceData NVARCHAR(50), Account NVARCHAR(50))

    DECLARE @check TABLE (Pattern NVARCHAR(10))

    INSERT INTO @check

    SELECT '%X-456' UNION ALL

    SELECT '4%' UNION ALL

    SELECT '45%' UNION ALL

    SELECT '456%' UNION ALL

    SELECT '456/%' UNION ALL

    SELECT '% 345' UNION ALL

    SELECT '%-4%' UNION ALL

    SELECT '%-45%' UNION ALL

    SELECT '%-456'

    INSERT INTO @people (SourceData, Account)

    SELECT '123X-456','56789' UNION ALL

    SELECT '456/123','A345' UNION ALL

    SELECT 'ABC 345','X567' UNION ALL

    SELECT '456 HELP-123','YCHB' UNION ALL

    SELECT '456/987','2345' UNION ALL

    SELECT 'CDE 345','XX987' UNION ALL

    SELECT '345X-456','YY6767'

    --SELECT * FROM @people

    --SELECT * FROM @check

    SELECT

    c.pattern,

    COUNT(p.LogID) AS [Count]

    FROM

    @people p

    LEFT JOIN @check c ON p.SourceData LIKE c.pattern

    GROUP BY

    c.pattern

    Cheers,


    SELECT quote FROM brain WHERE original = 1
    0 rows returned

  • Thank you for posting the code formatted correctly - do you, by chance, have a link to the formatting code for this forum regarding "How-To", so that I can do just as you did?

  • To create a table that houses all the patterns would take much too long. The rows in the actual tables are in the millions and I am looking for the logic that would do the work and traverse through finding all the characters and building all the patterns. Sort of like the Monte-Carlo simulation. http://westclintech.com/Blog/tabid/132/EntryId/28/Monte-Carlo-Simulation-in-SQL-Server.aspx

  • TCcool (6/26/2015)


    Thank you for posting the code formatted correctly - do you, by chance, have a link to the formatting code for this forum regarding "How-To", so that I can do just as you did?

    Not a problem...I really should add it to my signature...so I did. 😀


    SELECT quote FROM brain WHERE original = 1
    0 rows returned

  • TCcool (6/26/2015)


    To create a table that houses all the patterns would take much too long. The rows in the actual tables are in the millions and I am looking for the logic that would do the work and traverse through finding all the characters and building all the patterns. Sort of like the Monte-Carlo simulation. http://westclintech.com/Blog/tabid/132/EntryId/28/Monte-Carlo-Simulation-in-SQL-Server.aspx

    There is no good way I know of doing this and certainly not very efficiently. You want to extract every possible pattern derived from your column. Now multiply that by every possible way you could extract a pattern from a single string. So let's take a simple 3 letter string 'ABC', some of your patterns could look like.

    ABC, _BC, A_C, AB_, A__, _B_, __C, ___, A%, AB%, A%C, %C, %BC, %B% ***Remember even though AB_ & AB% would return the same for three letter words you want to compare them to all other strings with varying lengths so now AB_ & AB% would yield different results.

    So with that in mind ONE string up to 50 characters long (in your case) could generate hundreds of possible patterns. Multiply that out by the millions of rows you indicated and you end up with the mother of all cursors that would take forever to finish. That's before you even eliminated your duplicate patterns and then compared them back to the original strings so you could see how many each of those appear.

    That is if I understand you correctly...


    SELECT quote FROM brain WHERE original = 1
    0 rows returned

  • You understood me correctly - now, I can say that the percentage of varying matches, where I would put a having clause > X to reduce what is returned, would be about 10% of all records - which are the distinct values where there would be a solid pattern match occurring multiple times. I am in the same boat as many others are, which is why I have posed this question to many forums, with no good answer. If I had a good starting point though I could work off that.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply