Data Quality on the Open Web

  • Comments posted to this topic are about the item Data Quality on the Open Web

  • At work we have a quote about quality; "All data sucks, some data just sucks less."

    Just dealing with millions of name and addresses from hundreds of different systems, it's amazing how many ways just a few fields can be mangled. A good percentage of my time is spent writing data scrubbing routines and being a human ETL machine. The rest is spent explaining to folks why Excel workbooks and PDF files are not great sources of quality data.

  • On the point of Linux "not taking over the world", keep in mind that iOS and Android are both *nix-based systems, as is OS X. So *nix may very well be taking over the personal computing universe, just not in the form of "desktop Linux".

    On the point of "crowdsourcing data", Wikipedia was the ultimate proving ground that it approaches uselessness. As Tycho and Gabe (Penny Arcade) pointed out, an encyclopedia with the reliability of Schrödinger's cat is essentially useless (in this case, Wikipedia is in a quantum state of being right and wrong at the same time, depening on the unpredictable state of the data at the exact moment your page request hits the engine). (I'm not linking to Penny Arcade because it tends towards NSFW language, and I don't want SSC or Steve in trouble for any of that. Anyone who goes there on their own has been warned. They're funny, offensive, and worth reading, especially if you've ever played a computer game or RPG of any sort.)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Data provided from 3rd party sources can be pruned or perhaps even corrected by cross referencing it with other more official or trusted databases. For example when an anonymous user submits a business for listing on a website, the business name and address can be cross referenced with a database of businesses currently registered in that county.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Good points all. We are and have been in the "integrity of process" business and nominal data quality as best enforced by automation. In some systems if the data will fit into the bucket we take it. Many businesses and systems are not overly concerned with the "integrity of data" when this should be the real focal point of the systems we develop.

    We develop data systems with quality data and the center of the entire issue is the collection, evaluation, classification, presentation, summarization, protection and preservation of data. We are not developing systems because they operate well, we are building data systems to the advantage and advancement of the data and those who use it.

    To that end we should require a more prudent and complete process to evaluate and certify data then how do you feel about this today?

    We scrutinize our system and processes, we must also use that same measure on our data.

    Not all gray hairs are Dinosaurs!

  • chrisn-585491 (9/21/2011)


    At work we have a quote about quality; "All data sucks, some data just sucks less."

    Just dealing with millions of name and addresses from hundreds of different systems, it's amazing how many ways just a few fields can be mangled. A good percentage of my time is spent writing data scrubbing routines and being a human ETL machine. The rest is spent explaining to folks why Excel workbooks and PDF files are not great sources of quality data.

    We have an ERP system that shall remain nameless, it sucks so bad they would sue me if I gave their name.

    US W2 reporting requires companies to adhere to a standard format, capitals, no punctuation, something like that. When we first did W2's after going to this system, we discovered that all of them would be unacceptable due to not meeting government requirements.

    Now, the requirements were not new. We implemented a system that allowed the end user to enter information in a format that was unusable. We had to spend time manually editing every record to meet the requirements, and then asked the HR staff to follow processes that should be unnecessary.

    I got in this field because of the poor quality of programmers. I continue to have a job because of the poor quality of programmers.

    Dave

  • My initial thought was - what do you expect? Look at society. The majority of the media blast the one group that is willing to actually be journalists, and supports everything the so called US president does despite the fact that he has spent our country into overwhelming debt, destroyed the world economy, and wants to make it worse. His current jobs plan raises taxes $3 for every $1 cut, from a budget that is $1 TRILLION dollars more than the country brings in! Yet the media continues to talk about how great he is. The UN recently was found to have lied about global warming, when space shuttle experiements proved that heat loss from the Earth is orders of magnitude greater than the UN claimed. Al Queda receives support from France, Britain and the US to overthrow the Libyan terrorist that led the country, all we hear about is how great it is that they are helping the African people.

    And you want us to worry about things like whether a business is closed?

    It ain't happening, folks. There are people in this world who do not want us thinking for ourselves, and want us constantly fighting in class warfare struggles rather than realizing that our leaders suck. The low hanging fruit like Wikipedia and Google and Yahoo just aren't things we can do much about.

    More on point, should we endeavor to ensure our data is accurate, yes. Can we do that in a society where dishonesty is the rule, and ethics is the exception, probably not.

    More directly, if our managers allocate an hour for testing, how are we supposed to find time to actually figure out how someone might nefariously use a product we develop?

    We are fighting against significant odds of success. I would love to see our profession improve, I just don't see how it is going to happen to any large degree.

    Dave

  • GSquared (9/21/2011)


    On the point of Linux "not taking over the world", keep in mind that iOS and Android are both *nix-based systems, as is OS X. So *nix may very well be taking over the personal computing universe, just not in the form of "desktop Linux".

    Perhaps, but these aren't really open sourced takeovers. We are looking for companies here that are standing behind a specific version of the product. There's some good networking code from OpenBSD (or FreeBSD) in Windows.

  • djackson 22568 (9/21/2011)


    ...

    And you want us to worry about things like whether a business is closed?

    ...

    We are fighting against significant odds of success. I would love to see our profession improve, I just don't see how it is going to happen to any large degree.

    It's not worry about a business being closed, the point was that data quality can be an issue if you are pulling data from unreliable sources, like social/community/crowd sources.

    We are definitely going uphill, but it takes each person making a small effort to make a change.

  • Steve Jones - SSC Editor (9/21/2011)


    djackson 22568 (9/21/2011)


    ...

    And you want us to worry about things like whether a business is closed?

    ...

    We are fighting against significant odds of success. I would love to see our profession improve, I just don't see how it is going to happen to any large degree.

    It's not worry about a business being closed, the point was that data quality can be an issue if you are pulling data from unreliable sources, like social/community/crowd sources.

    We are definitely going uphill, but it takes each person making a small effort to make a change.

    I got the point, I was attempting to use what might be called an idiom. "You", did not mean you Steve Jones. Maybe I should have said "someone" or "they" or something else, but none of those fit either.

    Yes it does take each person making a small effort. My point is that while that goal is laudable, I don't believe it is realistic. That doesn't change the fact that we should try, rather it is meant to speak to what I believe is the likely outcome. I think there are far more people expending energy on hurting others through a number of means, and the majority of us have learned to either sit back and ignore it, or to watch it for entertainment. Sad, but unfortunately true.

    Dave

  • Steve Jones - SSC Editor (9/21/2011)


    djackson 22568 (9/21/2011)


    ...

    And you want us to worry about things like whether a business is closed?

    ...

    We are fighting against significant odds of success. I would love to see our profession improve, I just don't see how it is going to happen to any large degree.

    It's not worry about a business being closed, the point was that data quality can be an issue if you are pulling data from unreliable sources, like social/community/crowd sources.

    We are definitely going uphill, but it takes each person making a small effort to make a change.

    Websites compete for members and ad revenue. If users get the impression that one specific website is filled with crap information (for example a wiifi hotspot directory or realestate listing that leads them across town to a dead end), then they will stop using the site and turn elsewhere.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (9/22/2011)


    Steve Jones - SSC Editor (9/21/2011)


    djackson 22568 (9/21/2011)


    ...

    And you want us to worry about things like whether a business is closed?

    ...

    We are fighting against significant odds of success. I would love to see our profession improve, I just don't see how it is going to happen to any large degree.

    It's not worry about a business being closed, the point was that data quality can be an issue if you are pulling data from unreliable sources, like social/community/crowd sources.

    We are definitely going uphill, but it takes each person making a small effort to make a change.

    Websites compete for members and ad revenue. If users get the impression that one specific website is filled with crap information (for example a wiifi hotspot directory or realestate listing that leads them across town to a dead end), then they will stop using the site and turn elsewhere.

    The free market works, but it doesn't necessarily work fast enough. Sometimes, usually due to government interference, the free market isn't free and it fails.

    Dave

  • djackson 22568 (9/22/2011)


    Eric M Russell (9/22/2011)


    Steve Jones - SSC Editor (9/21/2011)


    djackson 22568 (9/21/2011)


    ...

    And you want us to worry about things like whether a business is closed?

    ...

    We are fighting against significant odds of success. I would love to see our profession improve, I just don't see how it is going to happen to any large degree.

    It's not worry about a business being closed, the point was that data quality can be an issue if you are pulling data from unreliable sources, like social/community/crowd sources.

    We are definitely going uphill, but it takes each person making a small effort to make a change.

    Websites compete for members and ad revenue. If users get the impression that one specific website is filled with crap information (for example a wiifi hotspot directory or realestate listing that leads them across town to a dead end), then they will stop using the site and turn elsewhere.

    The free market works, but it doesn't necessarily work fast enough. Sometimes, usually due to government interference, the free market isn't free and it fails.

    Hmmm? Aint anyone in D.C. telling me which hotspot directory or restraunt review site to use.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • djackson 22568 (9/21/2011)


    Yes it does take each person making a small effort. My point is that while that goal is laudable, I don't believe it is realistic. That doesn't change the fact that we should try, rather it is meant to speak to what I believe is the likely outcome. I think there are far more people expending energy on hurting others through a number of means, and the majority of us have learned to either sit back and ignore it, or to watch it for entertainment. Sad, but unfortunately true.

    I think you're correct overall. And it's sad. I do think that those that do try to make the world a better place, do so. In only small cases, but it's better than nothing.

  • There are open-source datasets out there but by their nature they can harbour crap data.

    I know of an occassion where someone didn't like the way that the Balleric Islands were listed under Spain so they duplicated them in the geographic hierachy thereby creating inaccurate data.

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply