Amateur Data Analysis

  • Comments posted to this topic are about the item Amateur Data Analysis

  • Good article, Steve.

    I wrote an article on some extremely basic analysis I performed on the way we measure rainfall, and asked some equally basic questions about what that data could possibly tell us. I still have not had a response from the UK's MET Office (Meteorological Office) about their very public claims about the "rainiest year on record".

    I would value others' opinions - I wrote it quite quickly, and would probably write it differently now, but I think my points are valid! And the subject is certainly worthy of discussion, as huge amounts of money and time is probably wasted, owing to what I call diplomatically "data free analysis"!

    My "Article" is here: http://public-highway.blogspot.co.uk/2013/01/rainmasterall-since-records-began.html

  • Data, visualizations, etc. as a field gained a lot of steam in the 70's when Semiology of Graphics: Diagrams, Networks, Maps was published -- it didn't get translated into English until the early 80's, and around that time Edward Tufte started publishing his work. Most of our traditional approaches to data and how it is presented and consumed is deeply flawed, and Tufte, Few and others have spend years trying to improve it. I recommend highly their work, in fact I think it is essential. Out of the box default graphs etc. that excel, ssrs etc produce are more often than not completely inappropriate and often misleading. Building data vizzes is hard work.

  • I've found that many people already have a notion of what they want to find in the data rather than finding what the data is telling them. As Granny used to say, "Figures can lie and liars figure". Such a thing occurred at one company that I worked for and, despite my continued warnings (with the very charts and graphs and raw data they were using) for a couple of years (seriously), it cost half the people in there company their jobs (1,500 out of 3,000) and management was still smiling and doing business as normal even during the layoffs. It's one of the main reasons why I refer to "Business Intelligence" as one of the world's greatest oxymorons.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've watched some interesting Ted talks and the visualization really helps to understand what they are talking about without being intimate with the data. I also went to a Jen Strirrup talk at PASS and wish I had time to learn more. Now that John Stewart is gone maybe I can spend that 20 minutes on data vis!

  • There really is no one size fits all solution to data analysis or reporting. Data analysis is supposed to answer question and reporting is supposed to deliver those in the meaningful way. Asking the wrong questions or trying to convey the wrong message will lead to bad results but that's not something that can be fixed with a generic solution like analysis this data in entirety and then use this reporting solution because it's always the best.

  • Thanks Steve, I have been working with both Power Pivot and another OpenData Portal Product that provides machine readable data to the public for our state. We have been publishing databases. maps, charts, and other visualizations out for public consumption.

    With the move in the industry today to move data out into the public domain there are some very interesting datasets coming to surface that data analysts can look into. Not plugging my state, Oregon State has their data out on https://data.oregon.gov/ and their collection of datasets is rather extensive. Or closer to your home is https://data.colorado.gov/ the Colorado Data Marketplace.

    If you take a few minutes you can find a dataset out in the public arena that is interesting and one that can build your analytical skills, and along the way it can fun.

    Not all gray hairs are Dinosaurs!

  • The problem with amateur data analysis is that the person doing the analysis doesn't necessarily have the full business or operational context necessary to do a full analysis of the data. Data visualizations skills are great to have. Data mining skills are great to have. Analytical skills in statistics are great to have at well. But, having an understanding of the source of the data, of the business operations that define the data, and of the business processes (and their limitations) that actually generate the data are all absolutely necessary to provide meaningful context to correctly interpreting the data. It isn't enough to merely mine the data, generate visualizations and statistics using the data, and then draw conclusions based on one's common knowledge or one's personal experience or what one assumes -- or even what one generally researches. The more in depth knowledge one has of the actually business activities, data sources and business processes, the better any analysis will be.

    [p]

    In the case of the example of college tuition and fees, there is much left out of the analysis - and many assumptions are being made. Large Universities have teaching staff and researching staff. Sometimes these overlap, and sometimes they don't. Some administrative positions are split positions which also involve teaching (e.g., Department Chairs are often tenured faculty who are administrators with a reduced teaching load and coded on payroll as administrative staff rather than faculty while they hold the Chair position, depending on the State.) Research staff may be faculty or administrative staff, and may or may not be paid out of tuition and fees, depending on the funding sources (if they are funded by State funds or grants.) When looking at costs associated with a public Colleges and Universities, one ALSO has to look at and analyze where the funding is coming from for the positions, and look at the overall funding streams for the College or University -- otherwise one gets a very skewed perspective of the costs associated with staffing and where the dollars are coming from and where they are going.

    [/p]

    [p]

    For example, in some states, there may be two or more State funding streams for operations. One is supplied by student paid tuition and fees plus the State's supplement, and the other by general state revenue. Other revenue streams exist in the form of self-funded budgets which are internal charges that departments may levy against other departments for providing services that are required by State or Federal law (due to compliance issues, which can be quite onerous and complex) but not paid for directly by the State budget, or for services that Students or Faculty or Departments want that aren't otherwise covered by State funds. Still other revenue streams exist in the form of endowments, grants or gifts that may be ongoing or one time streams that fund a project or service for a limited time or permanently. In many cases, administrative positions are required to support operations that are required to meet services that fall under increasingly complex regulations (for instance, Bursar and cashiering operations at Colleges and Universities fall under many of the same regulations as Banks, and require many of the same accounting and compliance reporting as Banks - which requires accountants with appropriate degrees and CPA credentials maintained.) Increased business automation may reduce front line staff (those who deal face-to-face with students at a counter) but it also increases the need for technical staff to maintain those systems as well as staff to handle phone calls, live chats, and emails for students who have issues with online Student/Parent portals and online Billpay systems - which is more expensive than having a cashiers at counters but much more convenient for students and parents.

    [/p]

    [p]

    Modernizing systems is costly as well, and requires that staff be constantly retrained in new systems as well as there being competent technical staff in house to handle the programming changes and transitions to new systems (or else there be funding available for contractors and consultants to assist, which is another model often used to staff up projects and train in house staff to take over maintenance of new systems once they go live.) In any case, the cost of these both for administrative operations and teaching can be high. On the teaching side, there is a constant push to keep rolling out new technologies for keeping current in the classroom with the latest hardware and software for learning and teaching, from what is available for faculty to deliver instruction to what is available to the students aid in learning. This leads to increase support costs for staff to maintain equipment (who then hire students to assist to reduce costs and provide employment opportunities on campus for students) and additional administrative overhead staffing costs as new "learning and teaching technologies" organizations are formed to manage the changes in some fashion across a campus (or across multiple campuses to reduce costs.)

    [/p]

    [p]

    The gross analysis of looking at "administrative" positions vs "teaching" positions is overly simplistic given the sorts of examples provided above -- and without knowing the funding streams for the positions, utterly useless an analysis as not all positions are funded by tuition and fees. Indeed, today tuition and fees at most public Colleges and Universities don't account for the majority of the operating revenue of a University! So knowing how a business operates is imperative to a solid analysis...

    [/p]

  • DavidL (8/6/2015)


    Data, visualizations, etc. as a field gained a lot of steam in the 70's when Semiology of Graphics: Diagrams, Networks, Maps was published -- it didn't get translated into English until the early 80's, and around that time Edward Tufte started publishing his work. Most of our traditional approaches to data and how it is presented and consumed is deeply flawed, and Tufte, Few and others have spend years trying to improve it. I recommend highly their work, in fact I think it is essential. Out of the box default graphs etc. that excel, ssrs etc produce are more often than not completely inappropriate and often misleading. Building data vizzes is hard work.

    I took a class from Tuft. It was really interesting, but it showed me how much work is required to build a good visualization. Haven't really managed one myself, but I do think about it.

  • Miles Neale (8/6/2015)


    Thanks Steve, I have been working with both Power Pivot and another OpenData Portal Product that provides machine readable data to the public for our state. We have been publishing databases. maps, charts, and other visualizations out for public consumption.

    With the move in the industry today to move data out into the public domain there are some very interesting datasets coming to surface that data analysts can look into. Not plugging my state, Oregon State has their data out on https://data.oregon.gov/ and their collection of datasets is rather extensive. Or closer to your home is https://data.colorado.gov/ the Colorado Data Marketplace.

    If you take a few minutes you can find a dataset out in the public arena that is interesting and one that can build your analytical skills, and along the way it can fun.

    Thanks, need to check out my CO data.

  • casachs 74147 (8/6/2015)


    The problem with amateur data analysis is that the person doing the analysis doesn't necessarily have the full business or operational context necessary to do a full analysis of the data.

    [/p]

    Incorrect. The problem is only with making decisions with an incomplete analysis. Certainly amateurs may not have all the information, but making an attempt, looking at data and trying to draw conclusions is how we learn. When others point out potential flaws, we can get better at both the topic itself (the context of the data) as well as our presentation skills.

    Certainly we could argue that some of what you've written is incorrect or out of context as well.

  • Steve Jones - SSC Editor (8/6/2015)


    casachs 74147 (8/6/2015)


    The problem with amateur data analysis is that the person doing the analysis doesn't necessarily have the full business or operational context necessary to do a full analysis of the data.

    [/p]

    Incorrect. The problem is only with making decisions with an incomplete analysis. Certainly amateurs may not have all the information, but making an attempt, looking at data and trying to draw conclusions is how we learn. When others point out potential flaws, we can get better at both the topic itself (the context of the data) as well as our presentation skills.

    Certainly we could argue that some of what you've written is incorrect or out of context as well.

    I have to disagree with you Steve.

    You're making it sound like it's okies for anyone to analyse the data because it's so easy for any Joe Blow to do so. Let me ask you something, is it fine for me to allow an amateur to do whatever they want in your enterprise data warehouse that is supporting a multi-million dollar product? Is it fine to give them all SA and just have it in production and so forth?

    Na, likely not. They don't have that experience yet. They don't have what's needed just yet to be trusted enough to give them that level of responsibility even with your guidance as a mentor.

    Data analyst are the same and have to earn the same level of responsibility. They are there to analyse the data so we can hopefully extrapolate critical business answers to make critical business decisions from the data.

    When doing that, time, money and even clients are sometimes on the table. That means if the analysis is wrong, you could cost the company time, money and a client much like you could with letting anyone just run in your production environment that is supporting a multi-million dollar product.

    You see, I work with a data scientist as a data engineer. He has to conduct analysis on the data I help provide so teams, management and even clients can make critical business decisions. As a data scientist, he is depended on to have that business knowledge and domain experience to ensure he is not wasting time, not wasting the clients money in the time spent to do the analysis and of course, not going to cause someone to make a poor decision that is going to cost the client or the business millions of dollars.

    While it's nice to think the buck stops at the people making the decisions to either catch a incomplete analysis or bad analysis, many do not. They depend on the analyst, much like someone may depend on you as the DBA to actually have faith in what they bring. That faith, is usually based on that business knowledge and domain experience in knowing, "Hey, I know the business, I know the data and here is my analysis on something that matters."

    You either come with completion and faith or you don't come at all. :hehe:

  • xsevensinzx (8/10/2015)


    I have to disagree with you Steve.

    You're making it sound like it's okies for anyone to analyse the data because it's so easy for any Joe Blow to do so. Let me ask you something, is it fine for me to allow an amateur to do whatever they want in your enterprise data warehouse that is supporting a multi-million dollar product? Is it fine to give them all SA and just have it in production and so forth?

    You're making a leap, and please disagree. That's the point here. I can perform an analysis as an amateur and we argue and debate the facts. We look at where I may or may not have made incorrect assumptions or had issues. This is amateur analysis, as the examples I showed have done. Not professional analysis inside of your company.

    You are taking this as black or white. Either you are qualified and trusted and do it, or you don't. Either you have all access or none. That's not the real world. We have shades of gray, and certainly the data scientists make mistakes as well.

    My point was to choose a set and start to analyze it. Not state you are 100% correct in your findings, but to ask questions and seek to improve your analysis (And visualization skills).

    \

  • Steve Jones - SSC Editor (8/11/2015)


    xsevensinzx (8/10/2015)


    I have to disagree with you Steve.

    You're making it sound like it's okies for anyone to analyse the data because it's so easy for any Joe Blow to do so. Let me ask you something, is it fine for me to allow an amateur to do whatever they want in your enterprise data warehouse that is supporting a multi-million dollar product? Is it fine to give them all SA and just have it in production and so forth?

    You're making a leap, and please disagree. That's the point here. I can perform an analysis as an amateur and we argue and debate the facts. We look at where I may or may not have made incorrect assumptions or had issues. This is amateur analysis, as the examples I showed have done. Not professional analysis inside of your company.

    You are taking this as black or white. Either you are qualified and trusted and do it, or you don't. Either you have all access or none. That's not the real world. We have shades of gray, and certainly the data scientists make mistakes as well.

    My point was to choose a set and start to analyze it. Not state you are 100% correct in your findings, but to ask questions and seek to improve your analysis (And visualization skills).

    \

    Well, it was more along the lines of stating that you don't need to know anything about the entity you are performing the analysis on. This is where I would fully disagree. Knowledge of what you are analyzing is pretty key to performing a good analysis. Otherwise, how would you know what you are doing is even remotely in the right direction or helping you learn?

    I mean, if you're going to justify throwing random darts (i.e.: your analysis) at a dart board (i.e.: value) as a valid tactic, then where do we start drawing the line?

    Sounds like a lot of wasted effort for nothing because you have no idea what direction to take. Part of becoming better from being amateur is about knowing what direction to take before you take it and adjusting the course based on the results you unearth. Not just doing random analysis on what you think may be important because you have know knowledge of the business or domain.

    For example, in my industry we are digital advertising. A amateur analysis may discover a drop in impressions, clicks and sales in a certain region by 90%. He/she may report on this decrease of activity and sales. Then others may believe something is not working.

    Yet in reality, the drop is due to a certain vendor being dropped from that particular region and what results that are found is just the trickled data from the vendor after they were pulled.

    The analyst didn't know anything about the business to attribute the drop to anything else other than a guess that maybe something bad is happening. Being he/she didn't know that, the person he/she reported too thought they knew better and assumed it was just the ads performing badly and ETC.

    So, while it's cool to simply play with numbers, you still need that knowledge to at least validate in order to correct that direction.

  • I agree you need to know something, but what is that bar? Can I analyze college spending? I think I can, though there are probably things I don't realize about the numbers. However if someone can point out a problem with my hypothesis and we debate it, I change and improve, that's progress.

    Can I analyze drug test results? No. I have no concept of how to even approach the data, so I'd have to seek some help to understand what the data means. However I could go from there with some help. Or I could help someone with knowledge learn to put together a report or analysis with my skills in working with data.

    I would say there are lots of data sets on which those of us working with data could begin an analysis.

Viewing 15 posts - 1 through 15 (of 25 total)

You must be logged in to reply to this topic. Login to reply