I'm not sure if I'm posting this in the right section but I could really do with some advice on how to progress this project. Any suggestions/ideas would be greatly appreciated!
I have been tasked with automating the following process:Existing setup:
Our data services department offer various services such as:
1) Batch duplicate processing
2) Data cleansing, address standardisation / validation.
3) Matching routines (i.e. Individual, Business, Family & Household)
The way the current process works is like this:
* Company ABC will provide us with a mailing list and a job spec (e.g. cleanse names, validate addresses and merge Household level)
* A network folder is created for the job which contains the source data files, documentation and also a SQL folder
* A new database is created for the customer (if one doesn't already exist) which will store the staging and output tables.
The SQL folder for the job will contain a dozen scripts to perform various data cleansing/matching routines, for example:
1) Data Hygiene script
2) Hygiene Quality Check script
3) merge script
4) Post merge script
5) Final Hygiene script
6) Output script
7) Hygiene Quality Check output script
There are 7 data processing staff members using 4 SQL server instances with each instance holding over 30 databases. The database for a job could be on any of these 4 SQL servers! The 4 SQL Servers are 2005 but we also have SQL Server 2008 instances used for other projects.
What I have found so far is that the scripts in each folder are all kind of similar but each one is specific to the job it was used for. So for example a merge script in one folder will contain various update statements to perform further data cleansing for that job while another have less etc...
The same applies to the other scripts. All the folders contain their own versions of these scripts and they're all similar but not the same.
What I am tasked with is automate the following process:
1) Data processing staff imports the raw file into SQL (this bit I don't have to deal with as we already have a data loader program implemented which does the job well.
2) They gather the required scripts from various folders and manually run each one (adding further updates as they go along)
3) Another staff member performs the Quality Check after each step
4) If another job comes from the same customer then the same database can be used and the same scripts get reused (again they get modified if further data cleansing is required)
I'm sure you guys get the picture by now!
So what we have is 4 SQL Servers each with 30+ databases in each one (some customers will have multiple databases whilst others will have just one database for recurring jobs)
40+ SQL folders and each folder will have its own version of the merge, data hygiene scripts etc...
I am now in my third week in this job and I can't make head no tail of this mess! I've spent hours going through scripts, then I started looking at the merge process last week and now I'm back to the beginning.
Now I am beginning to stress big time! I'm really keen to impress but I've not seen anything like this and I'm beginning to doubt myself. I've spent the last 4 years working in a Service Desk as a SQL Support Analyst so I have strong problem solving skills but I'm struggling to design a process to automate this mess.
Basically the ultimate goal is to automate the process so a source table goes through:
1) Data Cleansing.
2) Data Standardisation
3) Data Merge
Nothing is committed at this stage but a rich report should be provided to the data processing staff member so they get counts of:
1) Count of record with missing title/forename/surname etc.
2) Count of cleansed forenames/surnames/addresses etc...
3) Count of individual records/Business records or Family & Household.
4) Sample of matches and probably matches based on a match score
Once the user is happy with the result they can then commit the result which would ultimately merge it with the Single Customer View database for that customer.
Thanks for taking the time to read this and if anyone can offer help/advice I would be eternally grateful!
It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda. David Edwards - Media lens
Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.Howard Zinn