Data Has a Dollar Value

  • Comments posted to this topic are about the item Data Has a Dollar Value

  • Platforms like Azure have removed the barrier of entry for companies that want to perform analytics, but we still need to consider whether it makes sense for most organizations to invest in their own analytics projects, because there is still the need to load data, perhaps purchase additional datasets, and then hire staff or consultants with expertise in the field of data science. When it comes to performing analytics on externally generated or public datasets (social media, video, IoT, newsfeeds, etc.) it probably makes more sense for most organizations to go the route of analytics as a service (AaaS). If you're in the business of manufacturing bicycles, and you want to keep tabs on how your brand is trending on social media, then you don't need a data lake and a data scientist, just let a company like Microsoft, Google or Facebook crunch the numbers for you using the 100,000 servers and 1,000 analysts they already have on staff.
    Just as Google: "How is my company trending on social media for the 2018 Q2 ?"
    Google already knows the answer to that question in addition to 10,000 other metrics about your company. That's true Analytics As A Service.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • OK, both in your article Steve and in the video short you pointed to, the phrase "data labels" (or maybe "labeled data") was used. I've never heard that phrase. What does it mean?

    Kindest Regards, Rod Connect with me on LinkedIn.

  • It cannot be overstated just how valuable your data is. Entire mega-corporations live solely off "free" services for the cost of your voluntary, and sometimes involuntary, provided data. I wonder what it would take to get people to see that value. So far nothing is clicking. Even with such a bit of data as your location it is relatively easy to infer who you are, where you live, where you work, who you associate with, your daily patterns, who your mistress is, etc.

  • Used plastic bottles are valuable to a recycling company, while for most people (and companies) it's just a byproduct of our daily lives. To have a dollar value there must be a business model in place. The problem with personal data is that people don't think of how it's being accumulated and used.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Rod at work - Monday, August 20, 2018 9:01 AM

    OK, both in your article Steve and in the video short you pointed to, the phrase "data labels" (or maybe "labeled data") was used. I've never heard that phrase. What does it mean?

    meta data about your data. So in the ML world, you might have data that attached to other data. Say we have all these avatar pictures here at SSC. There's a name attached to each, which can be the label. This helps a system (or person) trying to learn how to identify a person.

    You can do this for all sorts  of data. You might have a set of purchases from a user. You might label different items that are repeat purchases, or one-off purchases. Those would be labels used later for additional analysis.

    These are really the ways we decide how to view data. Often these are attributes we have in databases, but they might also be new types of data we think about adding.

  • Eric M Russell - Monday, August 20, 2018 9:41 AM

    Used plastic bottles are valuable to a recycling company, while for most people (and companies) it's just a byproduct of our daily lives. To have a dollar value there must be a business model in place. The problem with personal data is that people don't think of how it's being accumulated and used.

    They don't know. I constantly see new uses of data I had no idea existed, or never thought about. We also have no idea how this stuff changes at scale. It's one thing to have the names of every person that owns properties in a book at the local government office. It's quite another when anyone in the world can access and scrape off all that data from all offices from anywhere.

  • Let me see if I get this:
    1.we want to use mass analytics tools, but
    2. we don't have enough data for the analysis to be statistically significant.  so
    3. we MAKE UP the data to have enough volume for the mass analytics to work.

    Anyone else see the problem with this approach? 
    You STILL don't have enough data, and frankly you've now tilted the model with fake data.  How is that defensible?  I'm not sure who the data scientist is that is signing off on these, but that really sounds suspect.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (4) - Monday, August 20, 2018 1:50 PM

    Let me see if I get this:
    1.we want to use mass analytics tools, but
    2. we don't have enough data for the analysis to be statistically significant.  so
    3. we MAKE UP the data to have enough volume for the mass analytics to work.

    Anyone else see the problem with this approach? 
    You STILL don't have enough data, and frankly you've now tilted the model with fake data.  How is that defensible?  I'm not sure who the data scientist is that is signing off on these, but that really sounds suspect.

    Granny used to say it best... "Figures can lie... and liars figure". 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Matt Miller (4) - Monday, August 20, 2018 1:50 PM

    Let me see if I get this:
    1.we want to use mass analytics tools, but
    2. we don't have enough data for the analysis to be statistically significant.  so
    3. we MAKE UP the data to have enough volume for the mass analytics to work.

    Anyone else see the problem with this approach? 
    You STILL don't have enough data, and frankly you've now tilted the model with fake data.  How is that defensible?  I'm not sure who the data scientist is that is signing off on these, but that really sounds suspect.

    I'm sure you're being a little funny here, but if you research into the field, this isn't just making up data. It's also not just used to justify some statistic or idea. The companies doing this are often trying to build models or conduct an analysis in some area that is very specific. ML/AI/ deep analytics don't work in general sense, and it's a place of failure for many projects.

    Companies have arisen the can produce data for development and testing. This isn't to validate your model, but rather to help build and train it. The validation would still happen with real data and would reveal flaws. In a simple example, companies struggle to find image data sometimes. search for the analysis of dogs and muffins to get an idea. What often happens is that I have 1TB of data. I split that 70% for training and 30%for validation. That likely gives me a good sense of whether things will work.

    However, if I have 1GB of data because we haven't been doing this very long, or it's a new idea, we want ALL of that data for validation. I can't test with it. I need new data. So there are companies that will mock this for you, knowing your problem domain and the type of data. They'll give you lots to test your system. Some even guarantee that if your model can't use the data, or it doesn't represent your problem domain well, they'll redo their work.

    As with anything, there are the same problems with this we find when developers get to build their own incomplete, small sets. However, this has helped some companies do a better job of building intelligent (poor word, but it's what we have) systems that work well. Medical companies have used this a bit because they don't have all the data they sometimes need, so competent people can mock up some MRIs, XRays, etc. Accounting for the variation in how real data presents conditions. The results are then tested against real data, knowing results, and evaluated. Then go forward or back to the drawing board.

  • Steve Jones - SSC Editor - Monday, August 20, 2018 4:35 PM

    I'm sure you're being a little funny here, but if you research into the field, this isn't just making up data. It's also not just used to justify some statistic or idea. The companies doing this are often trying to build models or conduct an analysis in some area that is very specific. ML/AI/ deep analytics don't work in general sense, and it's a place of failure for many projects.

    Companies have arisen the can produce data for development and testing. This isn't to validate your model, but rather to help build and train it. The validation would still happen with real data and would reveal flaws. In a simple example, companies struggle to find image data sometimes. search for the analysis of dogs and muffins to get an idea. What often happens is that I have 1TB of data. I split that 70% for training and 30%for validation. That likely gives me a good sense of whether things will work.

    However, if I have 1GB of data because we haven't been doing this very long, or it's a new idea, we want ALL of that data for validation. I can't test with it. I need new data. So there are companies that will mock this for you, knowing your problem domain and the type of data. They'll give you lots to test your system. Some even guarantee that if your model can't use the data, or it doesn't represent your problem domain well, they'll redo their work.

    As with anything, there are the same problems with this we find when developers get to build their own incomplete, small sets. However, this has helped some companies do a better job of building intelligent (poor word, but it's what we have) systems that work well. Medical companies have used this a bit because they don't have all the data they sometimes need, so competent people can mock up some MRIs, XRays, etc. Accounting for the variation in how real data presents conditions. The results are then tested against real data, knowing results, and evaluated. Then go forward or back to the drawing board.

    I am being a little funny but the reality still comes down to - extrapolation. Having more of the same data doesn't add to your predictive ability: you would learn the same thing if you ran your 1GB file through the same filter 1000 times.  If anything it could obscure what it is you didn't consider or wasn't in your original data.

    Training a model is essentially building a model: you are shaping how the machine learning will react. Training the model with synthetic data means you're reinforcing the skew or bias that may be present in the initial data set.  If you know your data well enough to know where the variations will be perhaps, but that probably means you already had the access to the much larger, realistic data set.

    Just like with many other tools - if you have the skills and the maturity to know how and when to use this kind of service, I am sure it can be useful.  I've personally run into a lot more orgs that do not have the maturity for it, so it feels like a dangerous gambit.  YMMV.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • A good example of where extrapolation can hurt is trend lines.  It can cause non-obvious trend lines to be totally obfuscated/discarded as a possibility and you have to remember that more than one curve can fit partial data... well... until someone clutters it up with extrapolated data that was created by the wrong curve being selected for the extrapolation.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Matt Miller (4) - Monday, August 20, 2018 1:50 PM

    Let me see if I get this:
    1.we want to use mass analytics tools, but
    2. we don't have enough data for the analysis to be statistically significant.  so
    3. we MAKE UP the data to have enough volume for the mass analytics to work.

    Anyone else see the problem with this approach? 
    You STILL don't have enough data, and frankly you've now tilted the model with fake data.  How is that defensible?  I'm not sure who the data scientist is that is signing off on these, but that really sounds suspect.

    Sentiment analysis on social media feeds is perhaps the most dubious, over-hyped, and ultimately useless form of analytics. Bots, paid promoters, fake accounts, and a feedback loop effect created by digital marketing companies has turned social media into an echo chamber.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Matt Miller (4) - Monday, August 20, 2018 6:19 PM

    ...
    Training a model is essentially building a model: you are shaping how the machine learning will react. Training the model with synthetic data means you're reinforcing the skew or bias that may be present in the initial data set.  If you know your data well enough to know where the variations will be perhaps, but that probably means you already had the access to the much larger, realistic data set.

    Just like with many other tools - if you have the skills and the maturity to know how and when to use this kind of service, I am sure it can be useful.  I've personally run into a lot more orgs that do not have the maturity for it, so it feels like a dangerous gambit.  YMMV.

    Absolutely. This stuff is hard and most people don't know how to do it. I think that's why having companies that are building data sets for others is a good idea. Most of us that build internal systems don't understand the data.

    BTW, your point about bias is good. A bad data set has bias. However, we also find that the natural data we have also has bias from the way humans have produced it with their decisions and actions. Better data is needed that helps work towards goals, whether that's natural data or synthetic.

  • Jeff Moden - Monday, August 20, 2018 7:46 PM

    A good example of where extrapolation can hurt is trend lines.  It can cause non-obvious trend lines to be totally obfuscated/discarded as a possibility and you have to remember that more than one curve can fit partial data... well... until someone clutters it up with extrapolated data that was created by the wrong curve being selected for the extrapolation.

    Absolutely, which is where the advanced researchers are working and to avoid linear analysis and too simple a view. Of course, they can overthink things as well, which is why this field is tough to work in. Extrapolation is very hard, and unless you are working on a narrow question, we often can't predict well with any accuracy for individuals. However, you often can get better with groups, as long as you always work with groups and don't get caught up on individuals

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply