More Open Data

  • Comments posted to this topic are about the item More Open Data

  • Great article Steve.  I love open data. I'm working on a data set of my local county's voting data.  It's presented in a series of PDFs for each election.  Good data, but not in a form that can be easily queried, sorted, grouped, etc.  One of the foreign keys in my main table will point to a table listing the source of each record in the main data set.  In this new world of fake data and alternate facts I believe it's crucial to know the source of things and how they were gathered.  In this case the source is my local election board, not me.

    Your article reminded my of a joke (I'm paraphrasing as best I can remember):
    A company was hiring for chief Business Analyst.  They thought they would weed out the candidates with a simple question, "What is 2+2?"
    An Engineer whipped out his slide rule and declared 2+2 was between 3.95 and 4.05.
    A Mathematician said he could supply a proof it was 4 in about 2 hours.
    A Therapist said, "I don't have the answers, but I'm glad we had this chance to talk."
    A Trader asked, "Are we buying or selling?"
    A Data Scientist (in the original it was a statistician) said, "What do you want it to be?"
    It's truly amazing how our biases affect the answers ... and how blind we are to them. 

    Data analysis needs to be part of every childs formal education.  Maybe not ridgid statistical methods, but we need to at least be able to spot BS.

  • One must always consider the bias of the person supplying information.  After spending a few minutes looking at some areas where I expected to see that bias, I am disappointed to say that it is clearly present. 

    OK, so what does that have to do with the post? Well, assuming that the data we are provided is accurate is our biggest issue in society today.  If you look up "crime rates" using Mr Ballmer's site, and then compare it to the actual DOJ/FBI and other sites, you will immediately notice that Mr Ballmer has an agenda.  When you compare what his site shows to the actual data provided by the government, you quickly notice trends that our government, Mr Ballmer, and "our media" do not want people to see.  This concerns me because I live close to Chicago, I have worked with a lot of people there, and I am aware of how difficult it is for them to raise their families in an area where violent crime is so rampant.  Instead of talking about the causes, the media, and I include Ballmer in this group, want to place the blame away from where it belongs.  Citizens in Chicago, especially minorities, are paying the price for our reluctance to point at the root cause. 

    When we fabricate data, or even hide what we don't want others to see, we do more harm than good.  Taking data that is freely available and "picking and choosing" what we allow people to see, if harmful at best.

    The concept might be a good one, if only we could trust those who spend their money on such sites, to disassociate themselves from their agenda.

    Dave

  • It would be great to have a non-partisan source for aggregating public data-sets, performing professional analysis, and publishing visualizations; if that's where Steve Ballmer is ultimately intending to go with this. So, for example, if a politician or activist on a cable news show cites a statistic about how ".. drunk drivers kill more people each year in America than handguns..", they can reference a specific research report on USAFacts.org, and viewers can then dig into the actual data and even post their own comments. I wouldn't expect the website to definitively settle many (if any) of the great debates going on in society, but at least it would provide a forum where folks can frame their back and forth dialog within the context of a common set of data (even if that includes challenging the validity of the underlying data or analysis).

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • steveo250k - Monday, May 1, 2017 7:21 AM

    Great article Steve.  I love open data. I'm working on a data set of my local county's voting data.  It's presented in a series of PDFs for each election.  Good data, but not in a form that can be easily queried, sorted, grouped, etc.  One of the foreign keys in my main table will point to a table listing the source of each record in the main data set.  In this new world of fake data and alternate facts I believe it's crucial to know the source of things and how they were gathered.  In this case the source is my local election board, not me.

    Your article reminded my of a joke (I'm paraphrasing as best I can remember):
    A company was hiring for chief Business Analyst.  They thought they would weed out the candidates with a simple question, "What is 2+2?"
    An Engineer whipped out his slide rule and declared 2+2 was between 3.95 and 4.05.
    A Mathematician said he could supply a proof it was 4 in about 2 hours.
    A Therapist said, "I don't have the answers, but I'm glad we had this chance to talk."
    A Trader asked, "Are we buying or selling?"
    A Data Scientist (in the original it was a statistician) said, "What do you want it to be?"
    It's truly amazing how our biases affect the answers ... and how blind we are to them. 

    Data analysis needs to be part of every childs formal education.  Maybe not ridgid statistical methods, but we need to at least be able to spot BS.

    +1 ,and kudos for linking in sources.

  • djackson 22568 - Monday, May 1, 2017 7:47 AM

    One must always consider the bias of the person supplying information.  After spending a few minutes looking at some areas where I expected to see that bias, I am disappointed to say that it is clearly present. 

    OK, so what does that have to do with the post? Well, assuming that the data we are provided is accurate is our biggest issue in society today.  If you look up "crime rates" using Mr Ballmer's site, and then compare it to the actual DOJ/FBI and other sites, you will immediately notice that Mr Ballmer has an agenda.  When you compare what his site shows to the actual data provided by the government, you quickly notice trends that our government, Mr Ballmer, and "our media" do not want people to see.  This concerns me because I live close to Chicago, I have worked with a lot of people there, and I am aware of how difficult it is for them to raise their families in an area where violent crime is so rampant.  Instead of talking about the causes, the media, and I include Ballmer in this group, want to place the blame away from where it belongs.  Citizens in Chicago, especially minorities, are paying the price for our reluctance to point at the root cause. 

    When we fabricate data, or even hide what we don't want others to see, we do more harm than good.  Taking data that is freely available and "picking and choosing" what we allow people to see, if harmful at best.

    The concept might be a good one, if only we could trust those who spend their money on such sites, to disassociate themselves from their agenda.

    Certainly always an issue when interpreting. However, each of us has to be aware. Two different views of crime (or anything) will result in two different reports. I also think that not every trend or pattern can be easily surfaced. If you think something is missing, let them know. Most of the reports I see listed don't try to provide fairly raw data. Perhaps not with any emphasis you would like to see, or that concerns you, which is expected. Each of us, or most of us, place a different emphasis on different aspects of the data.

    What I'd like to see is people take the data and analyze it themselves. It would be great to have others look at things, and even compare reports with original data. I certainly see plenty of sources listed, so for people that want to go to the government data and then try to determine how accurate the reports are, there are ways to do this. I don't know how to link forward from USAFacts, but certainly analysis can link back, or link back to government sources.

  • Eric M Russell - Monday, May 1, 2017 8:00 AM

    It would be great to have a non-partisan source for aggregating public data-sets, performing professional analysis, and publishing visualizations; if that's where Steve Ballmer is ultimately intending to go with this. So, for example, if a politician or activist on a cable news show cites a statistic about how ".. drunk drivers kill more people each year in America than handguns..", they can reference a specific research report on USAFacts.org, and viewers can then dig into the actual data and even post their own comments. I wouldn't expect the website to definitively settle many (if any) of the great debates going on in society, but at least it would provide a forum where folks can frame their back and forth dialog within the context of a common set of data (even if that includes challenging the validity of the underlying data or analysis).

    Would be nice if they could a) build reports quickly in response to reports or b) make something like PowerBi that allow someone to drop a few items together for analysis.

    The bigger issue is that most of the analysis of facts is way more complex than a two axis graph. I'd hope we'd get deeper analysis than that, though not likely on any broadcast news, perhaps not in written journalism either.

  • This is a semi-helpful, awkward, bland, "pre-school" view of government data. As someone who digs through various government data for a living, a hobby and as a voter, it's better to look at original sources and collaborate with others that understand them.

    So far I've had fun with Census , voting records, crime stats (local, state and federal), budgets (local, state and federal), FDIC, FCC, NCUA, NASA and a multitude of other sources.

    The best training other than the data training that we data/ETL developers and DBA's have would be statistics and accounting.  Here's a better source:
    https://www.data.gov/

  • I went to SQL Server user group once, where they were collecting and mapping crime statistics. It was quite interesting that they were not mapping rapes and murders.

    After the meeting I looked up the topic of murders in the area and found an excellent web page that showed the statistics on a map and in other ways:

    Homicide: Pittsburgh

    It would do well if people tracked the subjects they are interested in to see if the media is already reporting them.

    412-977-3526 call/text

  • Hrrm, maybe you're looking in the wrong locations for those analyzing data? There are a lot of data science blogs and articles out there that cover a great deal on how people are analyzing data. Just take a look at all the content on Kaggle for example.

    I think what is in limited supply right now--that sites such as Kaggle don't touch on--is getting data, cleaning it, and making it into a workable dataset. Kaggle for example mostly provides you with the data and even to the point of the questions you want to ask of the data. It doesn't provide you the original set of data and allow you to put it together before you can start asking the questions you want to ask of the data.

    The cool thing with that aspect is the fact most of us work in that section of the problem. We are the ones gathering that data, cleaning it, storing it and even exposing it to the end users. As more and more data collections happen--both private and open--what are the approaches we are taking even before someone analyzes it?

    One of the topics I know a lot of people like me are in search of is not along the lines of data acquisition, but scaling aggregation across really large and granular and sometimes high velocity data? For example, how can I computate over a billion records fast? How can I hash my data to prevent data skew so I can really slice and dice this data at scale? How can I really streamline analytics? I mean, what is the most dirty and screwed up data out there and how can I clean and then store it without X, Y, and Z? Etc

    I know that sounds like a big data pitch, but really, cleaning and prepping data when you are not using all these crazy big data tools that really aren't every fix to your problem in one box. That's the good stuff right there. We need more of that before we even get to analyzing it me thinks.

  • Most of the open data sets I deal with aren't in the billions of records, but still require a skilled eye to scrub and put into a useful format for mere mortals.

  • Steve Jones - SSC Editor - Monday, May 1, 2017 9:09 AM

    Eric M Russell - Monday, May 1, 2017 8:00 AM

    It would be great to have a non-partisan source for aggregating public data-sets, performing professional analysis, and publishing visualizations; if that's where Steve Ballmer is ultimately intending to go with this. So, for example, if a politician or activist on a cable news show cites a statistic about how ".. drunk drivers kill more people each year in America than handguns..", they can reference a specific research report on USAFacts.org, and viewers can then dig into the actual data and even post their own comments. I wouldn't expect the website to definitively settle many (if any) of the great debates going on in society, but at least it would provide a forum where folks can frame their back and forth dialog within the context of a common set of data (even if that includes challenging the validity of the underlying data or analysis).

    Would be nice if they could a) build reports quickly in response to reports or b) make something like PowerBi that allow someone to drop a few items together for analysis.

    The bigger issue is that most of the analysis of facts is way more complex than a two axis graph. I'd hope we'd get deeper analysis than that, though not likely on any broadcast news, perhaps not in written journalism either.

    Many data-sets, like the US Census, are OK as a starting point but not really rich enough standing alone for interesting analysis. What's truly powerful is when you're combining one data-set with one or more other data-sets. With Balmer's deep pockets and connections with Microsoft, he could have a huge Exabyte sized Azure SQL Warehouse containing practically every public data-set published, and then present a web based OLAP style interface for combining them. In addition to that, there could also be a team world class data scientists for ingesting data-sets from high-end paying clients (ex: governments, corporations, media outlets, etc.) and performing more in depth custom reporting. What's different about this business model is that they would be offering a repository of meaningful and holistic data instead of the usual chit-chat, spam, and cat videos that social media providers like FaceBook and Twitter have to chew on.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Tuesday, May 2, 2017 8:19 AM

     With Balmer's deep pockets and connections with Microsoft, he could have a huge Exabyte sized Azure SQL Warehouse containing practically every public data-set published, and then present a web based OLAP style interface for combining them. In addition to that, there could also be a team world class data scientists for ingesting data-sets from high-end paying clients (ex: governments, corporations, media outlets, etc.) and performing more in depth custom reporting. What's different about this business model is that they would be offering a repository of meaningful and holistic data instead of the usual chit-chat, spam, and cat videos that social media providers like FaceBook and Twitter have to chew on.

    Or some way to pick and choose downloads of specific columns from sets so that we can analyze them in something like PowerBI without pulling down that EB set.

  • djackson 22568 - Monday, May 1, 2017 7:47 AM

    One must always consider the bias of the person supplying information.  After spending a few minutes looking at some areas where I expected to see that bias, I am disappointed to say that it is clearly present. 

    OK, so what does that have to do with the post? Well, assuming that the data we are provided is accurate is our biggest issue in society today.  If you look up "crime rates" using Mr Ballmer's site, and then compare it to the actual DOJ/FBI and other sites, you will immediately notice that Mr Ballmer has an agenda.  When you compare what his site shows to the actual data provided by the government, you quickly notice trends that our government, Mr Ballmer, and "our media" do not want people to see.  This concerns me because I live close to Chicago, I have worked with a lot of people there, and I am aware of how difficult it is for them to raise their families in an area where violent crime is so rampant.  Instead of talking about the causes, the media, and I include Ballmer in this group, want to place the blame away from where it belongs.  Citizens in Chicago, especially minorities, are paying the price for our reluctance to point at the root cause. 

    When we fabricate data, or even hide what we don't want others to see, we do more harm than good.  Taking data that is freely available and "picking and choosing" what we allow people to see, if harmful at best.

    The concept might be a good one, if only we could trust those who spend their money on such sites, to disassociate themselves from their agenda.

    I agree with the intent wholeheartedly. That said - jumping to the conclusion that the data skew you found is directly associated to a conscious decision to suppress may be dangerous.  I know the crime statistics have been controversial for some time because among other things there's no clear definition or handling of what to report, "when" to report it, how to classify it, what method is used to collect the content, etc.  the collection process, the reliability of the data and the fall out of reporting data at all all weigh heavily into the quality or accuracy of the content (in addition to actual personal agendas of the data collector).

    The lack of control and visibility into how these data sets came to be I think becomes a limiting factor as to how much we can rely on those public sources.  As of now I often use such public info to inform myself or make decisions for me, but I would be hesitant to weight heavily on it for business decisions unless I can find a better way to determine its origin and/or detect when it's regenerated.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply