Need to minimize procesing time..

  • I have a mailing list with 1.5 million entries, representing 10 counties in California. The fields are the usual stuff: first name, last name, address, city, state, zip, date solicited. This list was generated about 1 year ago and I was informed that they now want me to use the same criteria, the same counties, and dump the data again. (Some folks will have moved but many will remain at the same address.) They want to re-solicit the people in the new dump. Furthermore, they want to track the number of times a given person was solicited in both dumps as well as the dates of solicitation. The trick here is recognizing that a person in the second dump also appeared in the first dump. The hard part is this looks like one massive Join to me. (Each dump has about 1.5 million records). Any thoughts on how to minimize the required processing time?

    TIA,

    Bill

  • I've had to do something like this during Gas-Market de-regulation in this country. We couldn't find a complete answer, the best was to try and identify a unique 'hit' by a combined key of Name, House-Number and Post-Code (zip to you) This works well in the UK as the PostCode combined with a House-Number is unique.

    So we had a table of records containing this key, a solicitation-dump identifier (varchar) and a contact-date. So it was relatively simple to identify all people included in particular Solicitation-runs. The combined key then gave us a single join onto the master address table.

    HOWEVER this doesn't find the folks who had changed address. Hope this gives the grey-cells something to gnaw over


    I have found it is possible to please all of the people all of the time if you do exactly what they want. Harold Macmillan 1961

  • Is the issue the join size or that you don't have a unique ID for the data in both sets to join on? If you have ID's I'd be curious to see how long it took. Possibly not as long as you'd think! If you don't have ID's it's harder. We use about a 10 step process to match on different combinations. The problem in the US is that addresses are barely standardized - you'll have St vs Street, Springs vs Spngs vs Spgs, etc. We do a cleanup operation and remove all spaces prior to the compare (in a separate table of course).

    Andy

  • Yes, a little more detail about the table involved and how the data is related could help us understand how to get to your solution or point you down the right path if the previous suggestions don't help.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • Andy,

    I can generate a "pseudo"-key based on first and last name, address, and zip. I'm concerned that the Join involved may take "forever" to complete or blow the available resources on the machine. (Basically I have 1.5 millions records in each of the tables I'm joining). I'm not sure what fields are best indexed in this situation.

    Bill

  • If you're going to take the time to build the match key, index it. I keep mine in a separate table along with primary key it goes to. Nice narrow table that way. I think you'll be surprised by how fast it goes.

    Andy

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply