Implementation of regular expressions

Question

Implementation of regular expressions

drew.georgopulos

SSCarpal Tunnel

Points: 4931
More actions
February 2, 2014 at 11:37 am

#144845

I have a project where the schema is the report.
IOW, its got Addr1, Addr2, Addr3, Phone1, Phone2...you get the idea.
This schema, named appropriately enough [Table For Form] is fed by a front end with zero integrity.
There are tons of dupes just based on spaces between suffixes and periods, near misspellings or tax id transpositions, a jumble of different integrity lapses that need to be reconciled and standardized, so that Prospect Heights Hospital comes out that way, and the six variations of that name are consolidated, e.g. merged into the master record for that entity.
My first thought was that this was a regex job, my question is, where to implement them?
I just read an article about making them udfs in CLR, but that looked excessive work to me.
I thought about dumping them to text and using powershell to interate over the lines and then invoke regex to match, store and look ahead, stuff that regex can do, housed in powershell.
Another alternative is to try to hack my way through an SSIS script task in C# trying to squeeze correctness from chaos.
Of the three, which approach best lends itself to expansion of new cases you discover as collateral damage from a prior fix, e.g. sometimes a name redundancy can be fixed by filtering on name, but then refiltering the table by the cleaned row for a new comparison, say on tax id, yields a new name variation of what you think you just fixed.
which is best for repetition, scheduling, consistency?
thanks in advance for your consideration
drew

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323506 More actions · Answer 1

i've added Phil Factor's Regular Expressions CLRs[/url] to my servers for a while now ;

in several of my projects, the regular expressions, since they are inline table value functions, are blazingly fast; i end up creating functions that call the CLR's to do my common cleanups so the regex string doesn't need to be a parameter.

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

Greg Edwards-268690 SSC-Insane Points: 20676 More actions · Answer 2

You mention the core issue - front end with no entry validation.

And understand that it is an ever growing guessing game.

Also that you have several different issues, and may have difficulty as you fix one piece, another may become questionable.

Your example of Tax ID is a good one.

I would lean towards first seeing if design of the front end could be influenced to include some key edits.

Any edits done on your end should probably also be shared with the Business.

They will likely notice differences, and probably will start questioning numbers.

I have seen where although they understand GE and General Electric may be the same, maybe they aren't.

In our case, we had and End Customer record, and were able to have it changed so they could only use existing Address book entries.

Only a few people could enter these, so there was more control.

There was also a Parent Child Hierarchy, so this had an even larger benefit.

If you do implement cleansing on the back end, is there anything that stands out as a key?

Nadrek SSC-Insane Points: 20039 More actions · Answer 3

First, give the front end a combo box dropdown, or a search, or some way to pick the right existing record.

On the SQL side, you've got a classic case of name and address matching, with the benefit of a few "unique" fields, like Tax ID, that "should not" be duplicated or mistyped (hah hah hah).

You will never get good results without a human involved, as there is no sweet spot - if you don't want false positives, you're going get a lot of false negatives. If you don't want to get false negatives, you're going to get some number of false positives.

I usually set this up as a series of matching rules - I loop through in more or less descending order of specificity, and remove items from the "match to" set (i.e. whatever your front end gives you) with each one, so any one item only gets matched by the "most specific" rule that finds at least one match. If you have only one item, I suppose you could do more than one rule at a time.

Then I report the "most specific" rule that found it, and if it's not specific enough, a human familiar with the data should double-check.

drew.georgopulos SSCarpal Tunnel Points: 4931 More actions · Answer 4

Thanks so much Lowell for the link!

It was a great reference and huge leg up, a great tactic, but I still am mulling strategy.

Greg and Nadrek

I don't have the juice to command ui changes. the app is sunset, being replaced.

I have to enable processing while ongoing matching and cleaning it with as little residual as possible for human inspection.

There is a potential logical primary key that has not been tested physically yet, meaning, it is alleged that license type, license state and license number are unique, but I cannot swear there are at most one and only one null an attribute, so it may or may not work out from a logical to a physical primary key. There is also NPI, but it can be dependent of levels of hierarchy, so I'm not sure that can work yet either, that is, a key based on hierarchy....have to try that <g>.

Thanks everyone for your thoughts

Drew