• Thanks for your input,

    Suggestions how to proceed are very welcome.

    My goal at the moment is finding similar strings.

    For a long reply see under the signature,

    Ben Brugman

    Sean Lange (1/17/2013)


    I would like to have an interface like:

    special_replace(@string, 'abcdefgh','12345678')

    where all the 'a' get replaced by a 1,

    all be 'b' get replaced by a 2, etc.

    or

    special_replace2(@string, 'aaa|bb|c|d|e|f|g|h','11|2|3|4444|5|6|7|8')

    where the 'aaa' get's replaced by a 11

    This was a fantasie example. (Correctly spotted)

    or

    special_replace3(@string, '1111|111|11|1|2222|222|22|2','41|31|21|11|42|32|22|12')

    where the '1111' get's replaced by a 41

    where the '111' get's replaced by a 31

    This is a part of an actual script I made.

    What a bizarre requirement!!! I can't come up with a real world scenario where any of these make sense. The regex suggestion won't really help here because you need to modify the values. All the regex is going to do is tell you if it matches some pattern, which you don't have.

    The example code was actual code I used.

    First to study what characters were used.

    (Alphabetic, Numeric, +-.,; etc)

    This was for me the fastest way to find out which characters where used in a column. First getting rid of all 'normal' characters then eliminating the others.

    Afterwards only the others were removed so that the columns could be compared more easy.

    I do however think that CLR would be the way to go here. You are going to have to loop through these strings character by character and t-sql just isn't the best way to do that.

    Processing speed was no issue.

    The speed I can (or could) write the code and rerun and change the code made the script as it was.

    And I do not master CLR code.

    I understand the requirements for the first and third versions but the second one I don't get why 'aaa' becomes 11 and why does 'd' become 4444????

    Correctly spotted this was an artificial example.

    I didn't want to use a predictive structure to avoid not fitting solutions. (sorry).

    For the first one you could do a pretty straight forward ascii calculation. Just force the string to upper first and then for each character take the ascii value and subtract 16.

    Again you spotted the correct function. Indeed this is the count of a sequence of same characters. This actually is the anwser to a puzzle, I think I will throw this puzzle in this forum. The above was part of the solution. (So don't tell anyone yet:-))

    (My solution for the puzzle was rather cumbersom with loads of replaces, I am wondering if other come up with other solutions to learn from).

    This would probably require a function and then a loop within the function for each position. Now I did a copy/past 26 times for the alphabeth, which for me works faster than writing a function with a loop.

    But sometime ago I saw that somebody had allready solved this problem. (With functions ?) And I was hoping to find that solution again and re-using that solution.

    The number of replaces I use in largest script is over 50 which is nog very elegant.

    In stringworkshop there was a far more elegant solution but that was for a single string parameter, not for a column in a select. I could adapt that, but that would take more time.

    The third one would be a modification to that logic, just need to track which character you are working with and a counter. When the character changes you output the counter and the same ascii calculation.

    Now it is within a single select statement, doing this for a column would make the code more complex again.

    So as you can read I am building these examples on the fly running them and then modifying them to my next requirements. Changes are almost continues. Running speed is not a problem, because the building takes some time.

    I use the technique with nested replaces fairly often, because I can do it inline and build the code fast. Building this code inline although not elegant makes the code easy portable. And if speed is no issue, then it works fast enough.

    A problem with this code is overview (to much replaces) and it is difficult to keep count of the correct number of replaces and the matching number of parameters for the replace. (A miscount is easely made).

    The actual use this time was.

    Excel with over 25 000 rows.

    Containing strings and the translations of the strings.

    But some strings (or equivalent strings) appear more than once, sometimes with small differences like a comma, or another special character.

    So I wanted to eliminate the special characters.

    (But wanted first to know which special characters where used, I found two different spaces for example, still have to study that).

    Also in the next step I am thinking of eliminating repeating characters, to exclude some writing mistakes and perhaps something like replacing 'qu' with 'q'

    and replacing all 'c' with an 'k' ('disc' and 'disk' for example is the same word for me).

    And maybe substituting all numeric strings for the same numeric string. For example replacing any nummeric string with 123.

    (I am thinking of breaking up the strings without special characters in triads (sets of three characters) and test how many of them match up, and give this a score if allmost all triads are the same, the difference might be a small spelling error or a typing mistake).

    Example.

    I am a fish.

    Would become.

    iamafish

    And then:

    iam

    ama

    maf

    afi

    fis

    ish

    A sentence like

    I am the fish

    or

    I am an fish

    Would give a large number of the same triads in the same order, so a large likelyhood of the 'same' sentence.

    Thanks for your input,

    Suggestions how to proceed are very welcome.

    My goal at the moment is finding similar strings.

    Ben Brugman