Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Matching rather unstructured character data against lookup tables Expand / Collapse
Author
Message
Posted Tuesday, December 4, 2012 4:12 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, June 16, 2014 8:43 AM
Points: 17, Visits: 394
Hello all,

I am looking for advice on possible approaches to take to solve a specific problem. The problem is in short about matching the contents of a string, that does not follow a strict structure, against known geographic information. Below I explain in more detail.

Background
I have a ~20M row SQL Server table with a nvarchar column, let's call it col1. This column contains information, separated by comma-signs, about:
- universities/medical departments/institutions
- cities
- zip-codes (if in US)
- states (if in US)
- country
- e-mail address

I want to parse the text in the column and match it's sub-parts against known information that I have prepared in tables for cities, zip-codes, states and countries, and output the result to separate columns. As you understand this would be a rather easy task if all values in col1 always contained the same amount of separators and we knew for sure which type of information was contained in each position. Unfortunately this is not the case. There is "kind" of a template that describes how a "correct" value in col1 should look like: "Department, Institution/University, City, State (if in US), zip (if in US), country, e-mail" but far from all values follow this pattern. The contents in this column comes from free-text input, ranging from several decades back in time up to today. Therefore the structure of the string in this column may look very different from row to row. To summarize: all elements are optional and any element can exist at any "position". To give you an idea here are a few examples:

1. "Department of Anesthesiology, University of Virginia Health Sciences Center Charlottesville 22908, USA. someemail@someemail.com"
2. "Tower Hamlets District Geriatric Service."
3. "Institute of Occupational Health, Helsinki, Finland."
4. "Institut für Physiologie, Physiologische Chemie und Ernährungsphysiologie im Fachbereich Tiemedizin, Universität München, Germany."
5. "Aberdeen Oomycete Laboratory, University of Aberdeen, Institute of Medical Sciences, Foresterhill, Aberdeen AB25 2ZD, UK. someemail@someemail.com"
6. "someemail@someemail.com"
7. "Chengdu, China."

As you can see the elements in the string are separated by commas except for the optional email-address at the very end which always seem to follow a punctuation mark.

I have prepared geographic lookup tables with countries, cities, states and zip-codes and their relations, ready for use to match against contents in col1.

What I am asking for
What I kindly ask for is general advice on approaches and strategies to take to solve this problem, as efficient as possible. How would you go about it? Which tools and functions would you use and why?

All comments, tips and recommendations are very much welcome!

Many thanks,
Martin
Post #1392349
Posted Tuesday, December 4, 2012 5:19 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 8:37 AM
Points: 3,905, Visits: 5,079
In SSIS there is a component for use with Fuzzy logic.
consider using this, in conjunction with your lookup tables, setting the confidence level, to find which value should go to which column...


____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”
Post #1392389
Posted Tuesday, February 12, 2013 3:30 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, June 16, 2014 8:43 AM
Points: 17, Visits: 394
Thanks for your reply Stewart. Will let you know if I try SSIS and fuzzy logic to solve this.

Martin
Post #1419222
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse