• Gail, I don't know if I would call this example of dirty data "weird" or "worrying" but it was certainly odd. We called it "The Curse of IVAN YEARS".

    In the early years of online trading, I was Test Manager on a project which built a website for a company which sold records and CDs etc. To test the online title search we imported the titles catalogue from the production system. (By-the-way, in the old production data, all the titles were UPPER CASE.) The testers noticed that every so often the system would find titles which looked like this "ALBUM TITLE IVAN YEARS". The album title was correct, but sometimes at the very end it would have the text "IVAN YEARS"! We investigated and found that the "IVAN YEARS" bit was in the original production data, so it wasn't a bug, but it puzzled everyone, including the Customer. There were 100s of these records randomly scattered over the database, all "...IVAN YEARS". We all wondered who or what IVAN YEARS was.

    In odd moments I investigated the problem and eventually found the cause. Lets say the album title column was char(80). In one (but only one!) of the maintenance screens in a green screen system, the album title field was (say) char(60). It turned out that for a time the data input people had been in the habit of not creating new records, but copying an old one AND BLANKING OUT THE TITLE! (it saved them quite a lot of keying). Unfortunately, what they didn't know was that the screen they were using had the 60 char field, and their favourite record was titled "I can't remember...SULLIVAN YEARS"! and char 60 fell on the second "L" of SULLIVAN! Every time they copied one of these records they were creating a title with an invisible (to them) "IVAN YEARS" at the end. The customer wasn't Amazon, but if you go there, you can still find CDs which (correctly) end ".... Sullivan Years" because that was a popular series of records.

    The root cause of the problem was a mis-match in field length between the database and the screen combined with a slightly dubious, but innocent practice in data-entry. The solution was a _carefully tested_ data-fix update done at about the same time as we converted the text in the database from UPPER CASE to Mixed Case. As dirty data goes, it was harmless, but it had me puzzled for quite a while!

    Tom Gillies LinkedIn Profilewww.DuhallowGreyGeek.com[/url]