Streaming itvf in CLR to replace special characters

  • Does anyone have a streaming itvf CLR function that can replace special characters in an nvarchar(max) (or smaller) column?

    Requirements are basically NCHAR(0) to '', NCHAR(1) - NCHAR(31) except NCHAR(9), NCHAR(10), NCHAR(13) to ' ' and specific Unicode characters to an encoded value. We are still working on identifying those specific characters that are causing issues in our data.

    I may have a solution that will still allow us to use T-SQL, but I am thinking it may not be very efficient if we have any data reaching the 2 GB limit of the NVARCHAR(MAX) data type.

    I am lucky to be able to spell C# at the moment.

  • I have a T-SQL solution.

    Nested REPLACEs in an in-line table valued function.

    May still revisit this as an exercise later when I have time to try and learn C#.

  • Still turns out there may be a limit to the number of replaces you can nest. Originally I was looking at 53 different characters, now it that is up to 6453. The original 53 identified and now the values between 57344 and 63743 (6400 more values) that need to be converted to spaces.

  • I didn't remember this when you first posted your problem, but have you tried using SQL#? It has Replace functions that might help you.

    http://www.sqlsharp.com/

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (7/18/2016)


    I didn't remember this when you first posted your problem, but have you tried using SQL#? It has Replace functions that might help you.

    http://www.sqlsharp.com/

    Although claims of performance have been made, I've never seen either a performance or accuracy test. I know you already know this but it'll make me feel better to say it out loud... test the hell out of it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lynn Pettis (7/18/2016)


    Still turns out there may be a limit to the number of replaces you can nest. Originally I was looking at 53 different characters, now it that is up to 6453. The original 53 identified and now the values between 57344 and 63743 (6400 more values) that need to be converted to spaces.

    Even well written C# code is going to spend some time on that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/18/2016)


    Lynn Pettis (7/18/2016)


    Still turns out there may be a limit to the number of replaces you can nest. Originally I was looking at 53 different characters, now it that is up to 6453. The original 53 identified and now the values between 57344 and 63743 (6400 more values) that need to be converted to spaces.

    Even well written C# code is going to spend some time on that.

    The hard part is that we need to fix the data at rest in the system. This "cleaning" of data should occur during data entry, not after it gets into the database.

  • Lynn Pettis (7/18/2016)


    Still turns out there may be a limit to the number of replaces you can nest. Originally I was looking at 53 different characters, now it that is up to 6453. The original 53 identified and now the values between 57344 and 63743 (6400 more values) that need to be converted to spaces.

    Hi Lynn. I just now saw this thread. Are you still needing help on this? If so, and if you have the exact Code Points for those initial 53, then I will do what I can to find the most efficient solution.

    Also, why are you converting those Code Points to a space? Just curious.

    Just FYI: SQLCLR does not have the concept of Inline TVFs. There are Streaming TVFs, but I don't believe, at the moment, that a TVF will help in this case. And in fact, SQLCLR Scalar UDFs that are marked as Deterministic can participate in parallel plans, which can help. That does not mean that they are always faster than T-SQL based solutions, but sometimes they are. It depends on several factors.

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (8/9/2016)


    Lynn Pettis (7/18/2016)


    Still turns out there may be a limit to the number of replaces you can nest. Originally I was looking at 53 different characters, now it that is up to 6453. The original 53 identified and now the values between 57344 and 63743 (6400 more values) that need to be converted to spaces.

    Hi Lynn. I just now saw this thread. Are you still needing help on this? If so, and if you have the exact Code Points for those initial 53, then I will do what I can to find the most efficient solution.

    Also, why are you converting those Code Points to a space? Just curious.

    Just FYI: SQLCLR does not have the concept of Inline TVFs. There are Streaming TVFs, but I don't believe, at the moment, that a TVF will help in this case. And in fact, SQLCLR Scalar UDFs that are marked as Deterministic can participate in parallel plans, which can help. That does not mean that they are always faster than T-SQL based solutions, but sometimes they are. It depends on several factors.

    Take care,

    Solomon..

    Actually, we have identified 6453 plus a few more (actual count I don't remember at the moment). Basically, converting to a space for the lower control codes is what I was instructed to do, with the exception of the null (n/char(0)) which needs to be eliminated (replaced with an empty string). I can give you the specs but may take a bit as I am working on another project of high priority at the moment.

  • Lynn Pettis (8/9/2016)


    Actually, we have identified 6453 plus a few more (actual count I don't remember at the moment). Basically, converting to a space for the lower control codes is what I was instructed to do, with the exception of the null (n/char(0)) which needs to be eliminated (replaced with an empty string). I can give you the specs but may take a bit as I am working on another project of high priority at the moment.

    Yes, you had mentioned that there were 6453 total, but you had specified the values for the range of 6400 of those but had only specified the Code Points for approximately 30 of the initial 53. Either way, yes, if you could provide the specific values / ranges that would help. Or maybe I can come up with something with what you have specified so far here and you can add to it.

    I kinda get the space as a replacement for the lower control codes. I guess I was wondering why that upper range of 6400 was being converted to spaces. Though I understand that you might not have been told the reasoning. I am mainly curious as to why this is being done because it is possible that there might be a better / more appropriate solution to whatever problem we are attempting to solve here. Meaning, out of the 65,536 Code Points, why are those 6453 (plus some) bad but the others are ok?

    Just let me know whenever you have time.

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (8/9/2016)


    Lynn Pettis (8/9/2016)


    Actually, we have identified 6453 plus a few more (actual count I don't remember at the moment). Basically, converting to a space for the lower control codes is what I was instructed to do, with the exception of the null (n/char(0)) which needs to be eliminated (replaced with an empty string). I can give you the specs but may take a bit as I am working on another project of high priority at the moment.

    Yes, you had mentioned that there were 6453 total, but you had specified the values for the range of 6400 of those but had only specified the Code Points for approximately 30 of the initial 53. Either way, yes, if you could provide the specific values / ranges that would help. Or maybe I can come up with something with what you have specified so far here and you can add to it.

    I kinda get the space as a replacement for the lower control codes. I guess I was wondering why that upper range of 6400 was being converted to spaces. Though I understand that you might not have been told the reasoning. I am mainly curious as to why this is being done because it is possible that there might be a better / more appropriate solution to whatever problem we are attempting to solve here. Meaning, out of the 65,536 Code Points, why are those 6453 (plus some) bad but the others are ok?

    Just let me know whenever you have time.

    Take care,

    Solomon..

    Actually the upper ranges are being converted to '&#NNNNN;' where NNNNN is the numeric value of the Unicode character.

    Here is the itvf I created for the first 53 values identified:

    USE CIDNE2

    GO

    if object_id('dbo.fnCleanString') is not null

    drop function dbo.fnCleanString;

    go

    /****** Object: UserDefinedFunction [dbo].[fnCleanString] Script Date: 07/01/2016 17:01:52 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE function [dbo].[fnCleanString]

    /**********************************************************************************************************************************

    Function: dbo.fnCleanString

    Site: ISS, Inc - Colorado Springs, CO

    Author: Lynn A Pettis

    Date: 2016-06-29

    The purpose of the function is to replace the lower control characters nchar(0) through nchar(31) not including nchar(9),

    nchar(10), or nchar(13) with an empty string ('') for nchar(0) or a space (' ') for all other control characters.

    The purpose of the function is also to replace the upper Unicode characters identified as a potential issue with WebServices and

    other applications with an encoded value. For example, nchar(8220) in a string will be replaced with '“'.

    Revision History

    Date Developer Jira Version Notes

    ---------- --------------------------- ----------- ------- -----------------------------------------------------------------------

    2016-06-29 Lynn A Pettis ROAR-406 1.0 Initial version

    **********************************************************************************************************************************/

    (

    @pTargetString nvarchar(max)

    )

    returns table with schemabinding

    as return (

    select

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(

    replace(@pTargetString collate latin1_general_bin2,

    nchar(0),

    ''),

    nchar(1),

    ''),

    nchar(2),

    ''),

    nchar(3),

    ''),

    nchar(4),

    ''),

    nchar(5),

    ''),

    nchar(6),

    ''),

    nchar(7),

    ''),

    nchar(8),

    ''),

    nchar(11),

    ''),

    nchar(12),

    ''),

    nchar(14),

    ''),

    nchar(15),

    ''),

    nchar(16),

    ''),

    nchar(17),

    ''),

    nchar(18),

    ''),

    nchar(19),

    ''),

    nchar(20),

    ''),

    nchar(21),

    ''),

    nchar(22),

    ''),

    nchar(23),

    ''),

    nchar(24),

    ''),

    nchar(25),

    ''),

    nchar(26),

    ''),

    nchar(27),

    ''),

    nchar(28),

    ''),

    nchar(29),

    ''),

    nchar(30),

    ''),

    nchar(31),

    ''),

    nchar(188),

    '¼'),

    nchar(189),

    '½'),

    nchar(190),

    '¾'),

    nchar(8211),

    '–'),

    nchar(8212),

    '—'),

    nchar(8216),

    '‘'),

    nchar(8217),

    '’'),

    nchar(8218),

    '‚'),

    nchar(8220),

    '“'),

    nchar(8221),

    '”'),

    nchar(8222),

    '„'),

    nchar(8223),

    '‟'),

    nchar(8224),

    '†'),

    nchar(8225),

    '‡'),

    nchar(8226),

    '•'),

    nchar(8240),

    '‰'),

    nchar(8482),

    '™'),

    nchar(127),

    ''),

    nchar(129),

    ''),

    nchar(141),

    ''),

    nchar(143),

    ''),

    nchar(144),

    ''),

    nchar(157),

    ''),

    nchar(160),

    ' ')

    as TargetString

    );

    GO

    Originally I was told to convert the higher 6400 values to a space but then the individual who set that requirement changed it to the '&#NNNNN;'.

  • Lynn Pettis (8/9/2016)


    Actually the upper ranges are being converted to '&#NNNNN;' where NNNNN is the numeric value of the Unicode character.

    Here is the itvf I created for the first 53 values identified:

    The purpose of the function is also to replace the upper Unicode characters identified as a potential issue with WebServices and other applications with an encoded value. For example, nchar(8220) in a string will be replaced with '“'.

    Hi Lynn. Thanks for providing that code as it gave the context that I was looking for. I am curious where the "bad" characters are coming from. The comment from the stored procedure, which I left in the quoted area above, mentions that this fix is to alleviate potential issues with Web Services and "other applications". Are those "other applications" web-based? If not, then will they be able to handle HTML-encoded entities (i.e. & #8220;)? Now that I have more info, this really feels like something that is most likely the wrong fix. I am guessing that the real issue is just a matter of web page/form encoding, and that encodings are being mixed. The reason I suspect this is that most of the code points that you are targeting are the ones that differ between Code Page 1252 (Latin1_General) and Unicode (lower range is same as ISO-8859-1), as well as the Unicode Code Points that those "differing" characters map to. For example:

    NCHAR(8240) == NCHAR(0x2030) == '‰' == CHAR(137) == CHAR(0x89)

    The full range is 0x80 (or decimal 128) through 0x9F (or decimal 159). You can see the Unicode version of them here (top 2 rows): http://unicode-table.com/en/blocks/latin-1-supplement/[/url]

    and the ASCII Code Page-1252 version of them here: https://en.wikipedia.org/wiki/Windows-1252#Code_page_layout (where you will see this range highlighted with a green border, and character 0x89 (137 in decimal) equates to 0x2030 in Unicode (equates to 8240 in decimal).

    And that upper block of 6400 characters (57344 through 63743 in decimal equates to E000 - F8FF in hex)? That is the Unicode Private Use Area block, as shown here: http://unicode-table.com/en/blocks/private-use-area/[/url]

    Hence, I am willing to bet that one or more web forms are missing the "charset" property (or maybe "content-encoding" ?), or have it set incorrectly, or maybe somewhere in the app code it is getting mangled. But these characters by themselves shouldn't be causing any issues.

    However, if this is how it shall be and nobody minds converting data into HTML-encoded entities in the database, as opposed to on the way out, or fixing it on the way it (as you also mentioned), then I think I can help you do this efficiently.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (8/12/2016)


    Lynn Pettis (8/9/2016)


    Actually the upper ranges are being converted to '&#NNNNN;' where NNNNN is the numeric value of the Unicode character.

    Here is the itvf I created for the first 53 values identified:

    The purpose of the function is also to replace the upper Unicode characters identified as a potential issue with WebServices and other applications with an encoded value. For example, nchar(8220) in a string will be replaced with '“'.

    Hi Lynn. Thanks for providing that code as it gave the context that I was looking for. I am curious where the "bad" characters are coming from. The comment from the stored procedure, which I left in the quoted area above, mentions that this fix is to alleviate potential issues with Web Services and "other applications". Are those "other applications" web-based? If not, then will they be able to handle HTML-encoded entities (i.e. & #8220;)? Now that I have more info, this really feels like something that is most likely the wrong fix. I am guessing that the real issue is just a matter of web page/form encoding, and that encodings are being mixed. The reason I suspect this is that most of the code points that you are targeting are the ones that differ between Code Page 1252 (Latin1_General) and Unicode (lower range is same as ISO-8859-1), as well as the Unicode Code Points that those "differing" characters map to. For example:

    NCHAR(8240) == NCHAR(0x2030) == '‰' == CHAR(137) == CHAR(0x89)

    The full range is 0x80 (or decimal 128) through 0x9F (or decimal 159). You can see the Unicode version of them here (top 2 rows): http://unicode-table.com/en/blocks/latin-1-supplement/[/url]

    and the ASCII Code Page-1252 version of them here: https://en.wikipedia.org/wiki/Windows-1252#Code_page_layout (where you will see this range highlighted with a green border, and character 0x89 (137 in decimal) equates to 0x2030 in Unicode (equates to 8240 in decimal).

    And that upper block of 6400 characters (57344 through 63743 in decimal equates to E000 - F8FF in hex)? That is the Unicode Private Use Area block, as shown here: http://unicode-table.com/en/blocks/private-use-area/[/url]

    Hence, I am willing to bet that one or more web forms are missing the "charset" property (or maybe "content-encoding" ?), or have it set incorrectly, or maybe somewhere in the app code it is getting mangled. But these characters by themselves shouldn't be causing any issues.

    However, if this is how it shall be and nobody minds converting data into HTML-encoded entities in the database, as opposed to on the way out, or fixing it on the way it (as you also mentioned), then I think I can help you do this efficiently.

    Take care,

    Solomon...

    We aren't exactly sure. Some of them are from copy/paste from other sources into our application, some we have no idea at this point and are continuing to research the issue.

    Some of the codes cause a failure of the FOR XML AUTO used to create the XML for transferring the data via our web services, some of the codes prevent displaying of the data by the application itself. You may be right about the "charset" property or "content-encoding" but those things are out of my control. Cold Fusion is simply Greek to me. I can't get the developers to use me for checking SQL code, I doubt they'd listen to me about CF areas.

  • Lynn Pettis (8/12/2016)


    We aren't exactly sure. Some of them are from copy/paste from other sources into our application, some we have no idea at this point and are continuing to research the issue.

    Some of the codes cause a failure of the FOR XML AUTO used to create the XML for transferring the data via our web services, some of the codes prevent displaying of the data by the application itself. You may be right about the "charset" property or "content-encoding" but those things are out of my control. Cold Fusion is simply Greek to me. I can't get the developers to use me for checking SQL code, I doubt they'd listen to me about CF areas.

    Cold Fusion, eh? Well, that could be part of the problem right there ;-). Interestingly enough, I found this post which seems to be attempting the same basic "cleaning" that you are doing, but in Cold Fusion:

    Cleaning High Ascii Values For Web Safeness In ColdFusion

    And then I found this post, which seems to be more inline with what I am getting at regarding the encoding setting so that the characters come in correctly to begin with, or can be understood properly on the way out:

    UTF-8 and Other Character Sets with ColdFusion

    That last one might not fix the FOR XML AUTO issue as that depends on where the control characters (those are the ones most likely causing problems with FOR XML AUTO) are coming in from, but it might. If nothing else, setting the encoding properly should at least greatly reduce the scope of what you need to clean up.

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (8/12/2016)


    Lynn Pettis (8/12/2016)


    We aren't exactly sure. Some of them are from copy/paste from other sources into our application, some we have no idea at this point and are continuing to research the issue.

    Some of the codes cause a failure of the FOR XML AUTO used to create the XML for transferring the data via our web services, some of the codes prevent displaying of the data by the application itself. You may be right about the "charset" property or "content-encoding" but those things are out of my control. Cold Fusion is simply Greek to me. I can't get the developers to use me for checking SQL code, I doubt they'd listen to me about CF areas.

    Cold Fusion, eh? Well, that could be part of the problem right there ;-). Interestingly enough, I found this post which seems to be attempting the same basic "cleaning" that you are doing, but in Cold Fusion:

    Cleaning High Ascii Values For Web Safeness In ColdFusion

    And then I found this post, which seems to be more inline with what I am getting at regarding the encoding setting so that the characters come in correctly to begin with, or can be understood properly on the way out:

    UTF-8 and Other Character Sets with ColdFusion

    That last one might not fix the FOR XML AUTO issue as that depends on where the control characters (those are the ones most likely causing problems with FOR XML AUTO) are coming in from, but it might. If nothing else, setting the encoding properly should at least greatly reduce the scope of what you need to clean up.

    Take care,

    Solomon..

    Your first link looks familiar but I don't have time to go scanning back through all the emails regarding this issue to verify. The second on, however, does not look familiar. I will pass on both since this is still an on going issue.

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply