I am looking for advice on possible approaches to take to solve a specific problem. The problem is in short about matching the contents of a string, that does not follow a strict structure, against known geographic information. Below I explain in more detail.
I have a ~20M row SQL Server table with a nvarchar column, let's call it col1. This column contains information, separated by comma-signs, about:
- universities/medical departments/institutions
- zip-codes (if in US)
- states (if in US)
- e-mail address
I want to parse the text in the column and match it's sub-parts against known information that I have prepared in tables for cities, zip-codes, states and countries, and output the result to separate columns. As you understand this would be a rather easy task if all values in col1 always contained the same amount of separators and we knew for sure which type of information was contained in each position. Unfortunately this is not the case. There is "kind" of a template that describes how a "correct" value in col1 should look like: "Department, Institution/University, City, State (if in US), zip (if in US), country, e-mail" but far from all values follow this pattern. The contents in this column comes from free-text input, ranging from several decades back in time up to today. Therefore the structure of the string in this column may look very different from row to row. To summarize: all elements are optional and any element can exist at any "position". To give you an idea here are a few examples:
1. "Department of Anesthesiology, University of Virginia Health Sciences Center Charlottesville 22908, USA. email@example.com"
2. "Tower Hamlets District Geriatric Service."
3. "Institute of Occupational Health, Helsinki, Finland."
4. "Institut für Physiologie, Physiologische Chemie und Ernährungsphysiologie im Fachbereich Tiemedizin, Universität München, Germany."
5. "Aberdeen Oomycete Laboratory, University of Aberdeen, Institute of Medical Sciences, Foresterhill, Aberdeen AB25 2ZD, UK. firstname.lastname@example.org"
7. "Chengdu, China."
As you can see the elements in the string are separated by commas except for the optional email-address at the very end which always seem to follow a punctuation mark.
I have prepared geographic lookup tables with countries, cities, states and zip-codes and their relations, ready for use to match against contents in col1.
What I am asking for
What I kindly ask for is general advice on approaches and strategies to take to solve this problem, as efficient as possible. How would you go about it? Which tools and functions would you use and why?
All comments, tips and recommendations are very much welcome!