SQL help - how do we find that a particular string appears x times in a certain field

  • Matt Miller (#4) (8/25/2016)


    I'm sorry if I am taking the fun out of building a SQL only solution, but there are lots of basic samples of SQLCLR showing how to expose the Regex.Matches() method from .NET CLR.

    A good evrsion is located here courtesy of Phil Factor:

    https://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/[/url]

    NO, NO, NO!

    Use NGrams8K[/url]. It will perform better than any CLR or regex solution Ive ever seen and you dont need to install anything, learn any new programming languages, change any SQL Server settings or add any new assemblies.

    I'm on a train without SQL Server access at the moment but I'll post something later tonight unless someone beats me to it.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • The content column of the catalog is xml data, so why not query it as such?

    I don't have a sql 2012 RS database to grab the correct namespace and I don't have any DataSourceReference tags in my 2016 data, but the general format of the query would be this for 2016.

    WITH XMLNAMESPACES( default 'http://schemas.microsoft.com/sqlserver/reporting/2016/01/reportdefinition')

    SELECT

    cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[1]','varchar(100)') as FirstDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[2]','varchar(100)') as SecondDataSource

    FROM [ReportServer].[dbo].[Catalog]

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Thx but that is not really the challenge...

    What I need is an expression that will give me the 3 letters that follow the nth occurrence of the key word if it exists..

    I mean see the code below, I need the 3 letters that follow the 2nd occurrence of the word <ConnectString>

    ...

    Sorry forgot to paste the code in my previous post, so here it is. What I meant was after the word <DataReference>

    Ok, to get 3 letters that follow the nth occurrence of the key word if it exists using NGrams8K[/url]:

    -- User supplied variables

    DECLARE

    @searchString varchar(100) = '<DataReference>', -- string to search for

    @N tinyint = 2; -- occurance

    SELECT [str], newstring = SUBSTRING([str], position+LEN(@searchstring), 3)

    FROM

    (

    SELECT

    tokenNumber = ROW_NUMBER() OVER (PARTITION BY [str] ORDER BY position),

    [str],

    position

    FROM #t AS t

    CROSS APPLY dbo.NGrams8k([str], LEN(@searchString))

    WHERE token = @searchString

    ) ng

    WHERE tokenNumber = @N;

    As Magoo suggested, because you're working with XML you can just shred it using XQuery. First some sample data:

    -- create sample XML data

    if object_id('tempdb..#t2') is not null DROP TABLE #t2;

    WITH iTally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM (VALUES (1),(1),(1),(1),(1),(1)) x(a)

    ),

    D AS

    (

    SELECT t2.N, LEFT(newid(),5) AS DataReference

    FROM iTally t1, iTally t2

    WHERE t2.N < 6

    )

    SELECT SomeID = N, SampleXML =

    (

    SELECT DataReference

    FROM D D2

    WHERE D.N = D2.N

    FOR XML PATH(''), ROOT('References'), TYPE

    )

    INTO #t2

    FROM D

    GROUP BY N;

    ... and the solution

    DECLARE @N tinyint = 2;

    SELECT SomeId, txt --, SampleXML -- uncomment for testing

    FROM

    (

    SELECT

    SomeId,

    txt = LEFT(x.value('(./text())[1]', 'varchar(100)'),3),

    SampleXML,

    rn =

    ROW_NUMBER() OVER

    (

    PARTITION BY SomeId

    ORDER BY x.value('(../DataReference)[1]', 'varchar(100)')

    )

    FROM #t2 t

    CROSS APPLY t.SampleXML.nodes('*:References/DataReference') tx(x)

    ) AS splitNodes

    WHERE rn = @N;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Matt Miller (#4) (8/25/2016)


    I'm sorry if I am taking the fun out of building a SQL only solution, but there are lots of basic samples of SQLCLR showing how to expose the Regex.Matches() method from .NET CLR.

    A good evrsion is located here courtesy of Phil Factor:

    https://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/[/url]

    Hey there ol' friend... you and I went through something similar years ago where regex didn't do so well performance-wise. Have you tried the performance for this solution?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This seems like a close solution to what I want. However I am getting a syntax error ( listed below ) Can you help me with the syntax.

    WITH XMLNAMESPACES( 'http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition' as pd )

    SELECT

    cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[1]','varchar(100)') as FirstDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[2]','varchar(100)') as SecondDataSource

    FROM [ReportServer].[dbo].[Catalog]

    --------

    Msg 9420, Level 16, State 1, Line 1

    XML parsing: line 1, character 3, illegal xml character

  • The problem with the proposed xml solutions, is that the sample data is not xml. It has to be formatted to be xml and that would be more work.

    That's if the sample data actually resembles the real data.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • mw112009 (8/26/2016)


    This seems like a close solution to what I want. However I am getting a syntax error ( listed below ) Can you help me with the syntax.

    WITH XMLNAMESPACES( 'http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition' as pd )

    SELECT

    cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[1]','varchar(100)') as FirstDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[2]','varchar(100)') as SecondDataSource

    FROM [ReportServer].[dbo].[Catalog]

    --------

    Msg 9420, Level 16, State 1, Line 1

    XML parsing: line 1, character 3, illegal xml character

    That looks like you might have some invalid xml in the Content of your Catalog.

    Perhaps try it just for one row first, either using TOP(1) or by selecting a specific report.

    Alternatively, you could try this

    SELECT

    cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[1]','varchar(100)') as FirstDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[2]','varchar(100)') as SecondDataSource

    FROM (

    SELECT * --Sorry, don't have the actual column names to hand

    FROM [ReportServer].[dbo].[Catalog]

    WHERE TRY_CONVERT(XML, cast([Content] as varbinary(max)) ) IS NOT NULL

    ) Catalog

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • All set for now ... Just by adding a line to the where clause I managed to get rid of that error..

    WORKS fine.. but I do have a second question will post it in a few minutes.

    So the following query works fine.

    WITH XMLNAMESPACES( DEFAULT 'http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition' )

    SELECT

    name

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[1]','varchar(100)') as FirstDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[2]','varchar(100)') as SecondDataSource

    FROM [ReportServer].[dbo].[Catalog]

    WHERE

    type = 2

  • Question

    Consider the segment of xml pasted below

    <DataSources>

    <DataSource Name="EDW">

    <DataSourceReference>EDW</DataSourceReference>

    <rd:SecurityType xmlns:rd="http://schemas.microsoft.com/SQLServer/reporting/reportdesigner">None</rd:SecurityType>

    <rd:DataSourceID xmlns:rd="http://schemas.microsoft.com/SQLServer/reporting/reportdesigner">91905df0-3a05-4d1b-a59b-568a018c7ad1

    </rd:DataSourceID>

    </DataSource>

    </DataSources>

    The Query works fine, except i am not getting a value for the <DataSourceReference> ,

    So the last item in my SELECT STMT needs some syntax correction.

    WITH XMLNAMESPACES( DEFAULT 'http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition' )

    SELECT

    name

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[1]','varchar(100)') as FirstDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[2]','varchar(100)') as SecondDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[3]','varchar(100)') as ThirdDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[4]','varchar(100)') as FourthDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[5]','varchar(100)') as FifthDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(DataSourceReference)[1]','VARCHAR(250)') as DataSourceReference1

    FROM [ReportServer].[dbo].[Catalog]

    WHERE

    type = 2

  • mw112009 (8/26/2016)


    Question

    Consider the segment of xml pasted below

    <DataSources>

    <DataSource Name="EDW">

    <DataSourceReference>EDW</DataSourceReference>

    <rd:SecurityType xmlns:rd="http://schemas.microsoft.com/SQLServer/reporting/reportdesigner">None</rd:SecurityType>

    <rd:DataSourceID xmlns:rd="http://schemas.microsoft.com/SQLServer/reporting/reportdesigner">91905df0-3a05-4d1b-a59b-568a018c7ad1

    </rd:DataSourceID>

    </DataSource>

    </DataSources>

    The Query works fine, except i am not getting a value for the <DataSourceReference> ,

    So the last item in my SELECT STMT needs some syntax correction.

    WITH XMLNAMESPACES( DEFAULT 'http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition' )

    SELECT

    name

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[1]','varchar(100)') as FirstDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[2]','varchar(100)') as SecondDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[3]','varchar(100)') as ThirdDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[4]','varchar(100)') as FourthDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[5]','varchar(100)') as FifthDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(DataSourceReference)[1]','VARCHAR(250)') as DataSourceReference1

    FROM [ReportServer].[dbo].[Catalog]

    WHERE

    type = 2

    You have the wrong XPATH and need to grab the text() node

    WITH XMLNAMESPACES( DEFAULT 'http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition' )

    SELECT

    name

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[1]','varchar(100)') as FirstDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[2]','varchar(100)') as SecondDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[3]','varchar(100)') as ThirdDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[4]','varchar(100)') as FourthDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/@Name)[5]','varchar(100)') as FifthDataSource

    , cast(cast([Content] as varbinary(max)) as xml).value('(//DataSources/DataSource/DataSourceReference/text())[1]','VARCHAR(250)') as DataSourceReference1

    FROM [ReportServer].[dbo].[Catalog]

    WHERE

    type = 2

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Thanks a lot , I need some syntax help here ....

    The syntax is wrong below,

    How would I extract the INITIAL CATALOG and the [Data Source] from the <ConnectString> node ?

    Help is appreciated

    /*

    <?xml version="1.0" encoding="utf-8"?>

    <DataSourceDefinition xmlns="http://schemas.microsoft.com/sqlserver/reporting/2006/03/reportdatasource">

    <Extension>SQL</Extension>

    <ConnectString>Data Source=MHPDATA2;Initial Catalog=CCMSPROD</ConnectString>

    <CredentialRetrieval>Integrated</CredentialRetrieval>

    <Enabled>True</Enabled>

    </DataSourceDefinition>

    */

    If object_id('tempdb..#DataSources') IS NOT NULL DROP TABLE #DataSources;

    create table #DataSources( name varchar(500) , DataSource varchar(500), InitialCatalog varchar(500) );

    WITH XMLNAMESPACES( DEFAULT 'http://schemas.microsoft.com/sqlserver/reporting/2008/01/reportdefinition' )

    SELECT

    name

    , cast(cast([Content] as varbinary(max)) as xml).value('(//ConnectString/text())[1]','varchar(100)') as [Data Source]

    , cast(cast([Content] as varbinary(max)) as xml).value('(//ConnectString/text())[1]','varchar(100)') as [Initial Catalog]

    FROM [ReportServer].[dbo].[Catalog]

    WHERE

    type = 5

  • You need to extract the <ConnectString> text() node as shown previously, then use something like DelimitedSplit8K[/url] to split that string into pieces.

    I'm sorry but I don't have time right now to code that, maybe someone else can help?

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Thanks.. I think I can handle from here.. Shall post the solution later..

  • Jeff Moden (8/26/2016)


    Matt Miller (#4) (8/25/2016)


    I'm sorry if I am taking the fun out of building a SQL only solution, but there are lots of basic samples of SQLCLR showing how to expose the Regex.Matches() method from .NET CLR.

    A good evrsion is located here courtesy of Phil Factor:

    https://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/[/url]

    Hey there ol' friend... you and I went through something similar years ago where regex didn't do so well performance-wise. Have you tried the performance for this solution?

    Actually I just did. It's actually impressive, and having revalidated against the CLR solution, at least from my test - Alan's solution does beat CLR. But the interesting thing is that while running it on 2014, on large datasets (20M rows, 2-3k strings with avg 7 matches), on my machine - they are pretty darned close (total run took 8:30):

    Ngrams8k came in at 4:08, CLR @ 4:22.

    Which was about the same gap got on "small" datasets (100K rows, 250 char length, 3-4 matches). About 4 secs for Ngrams, 19 secs on CLR.

    So - NGRAMS8k clearly does seem to perform a bit better, just not the blowout I was half expecting to see. Still it was the first time I browsed through Alan's article on it, so will kick the tires a bit more.

    Thanks for the heads up.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Thanks, Matt. I sure do appreciate the testing and the feedback.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Viewing 15 posts - 16 through 29 (of 29 total)

    You must be logged in to reply to this topic. Login to reply