Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Matching rather unstructured character data against lookup tables


Matching rather unstructured character data against lookup tables

Author
Message
Martin_A
Martin_A
Grasshopper
Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)

Group: General Forum Members
Points: 19 Visits: 441
Hello all,

I am looking for advice on possible approaches to take to solve a specific problem. The problem is in short about matching the contents of a string, that does not follow a strict structure, against known geographic information. Below I explain in more detail.

Background
I have a ~20M row SQL Server table with a nvarchar column, let's call it col1. This column contains information, separated by comma-signs, about:
- universities/medical departments/institutions
- cities
- zip-codes (if in US)
- states (if in US)
- country
- e-mail address

I want to parse the text in the column and match it's sub-parts against known information that I have prepared in tables for cities, zip-codes, states and countries, and output the result to separate columns. As you understand this would be a rather easy task if all values in col1 always contained the same amount of separators and we knew for sure which type of information was contained in each position. Unfortunately this is not the case. There is "kind" of a template that describes how a "correct" value in col1 should look like: "Department, Institution/University, City, State (if in US), zip (if in US), country, e-mail" but far from all values follow this pattern. The contents in this column comes from free-text input, ranging from several decades back in time up to today. Therefore the structure of the string in this column may look very different from row to row. To summarize: all elements are optional and any element can exist at any "position". To give you an idea here are a few examples:

1. "Department of Anesthesiology, University of Virginia Health Sciences Center Charlottesville 22908, USA. someemail@someemail.com"
2. "Tower Hamlets District Geriatric Service."
3. "Institute of Occupational Health, Helsinki, Finland."
4. "Institut für Physiologie, Physiologische Chemie und Ernährungsphysiologie im Fachbereich Tiemedizin, Universität München, Germany."
5. "Aberdeen Oomycete Laboratory, University of Aberdeen, Institute of Medical Sciences, Foresterhill, Aberdeen AB25 2ZD, UK. someemail@someemail.com"
6. "someemail@someemail.com"
7. "Chengdu, China."

As you can see the elements in the string are separated by commas except for the optional email-address at the very end which always seem to follow a punctuation mark.

I have prepared geographic lookup tables with countries, cities, states and zip-codes and their relations, ready for use to match against contents in col1.

What I am asking for
What I kindly ask for is general advice on approaches and strategies to take to solve this problem, as efficient as possible. How would you go about it? Which tools and functions would you use and why?

All comments, tips and recommendations are very much welcome!

Many thanks,
Martin
Stewart "Arturius" Campbell
Stewart "Arturius" Campbell
SSCertifiable
SSCertifiable (5.8K reputation)SSCertifiable (5.8K reputation)SSCertifiable (5.8K reputation)SSCertifiable (5.8K reputation)SSCertifiable (5.8K reputation)SSCertifiable (5.8K reputation)SSCertifiable (5.8K reputation)SSCertifiable (5.8K reputation)

Group: General Forum Members
Points: 5769 Visits: 7130
In SSIS there is a component for use with Fuzzy logic.
consider using this, in conjunction with your lookup tables, setting the confidence level, to find which value should go to which column...

____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”
Martin_A
Martin_A
Grasshopper
Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)

Group: General Forum Members
Points: 19 Visits: 441
Thanks for your reply Stewart. Will let you know if I try SSIS and fuzzy logic to solve this.

Martin
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search