More and More Data Growth

  • Comments posted to this topic are about the item More and More Data Growth

  • Interestingly enough, as data grows for most of us, we also have ways of dealing with it.

    For example, I have a rather large transactional data warehouse. Without notice, one of the data sources goes from 1 million records a day to an average of 20 million records a day. A few years ago, I would have been worried about how to handle that growth. Today, there are a few options that put me at ease and I have not made any major changes to my infrastructure from those years ago to today. I'm still on premise, in 2008 R2 and so on.

    If it's not complex algorithms helping us filter the good from the bad, it's cloud infrastructure to help us augment the workload. I can spin up 100 machines without missing a beat to handle a sudden increase of data, attach it to my infrastructure and be on my way without batting an eyelash.

  • The data growth isn't coming from OLTP.

    I've been watching and participating in the Big Data world for a while and I see non-IT people bragging about their Big Data usage and the value they will (note the future tense) get out of it.

    When you dig a bit deeper you find that the volume of data can be stated using a unit of measure called a sod-all. They are failing to get much out of their existing data so why they think a digital deluge will help God only knows.

    Many of their velocity problems are caused by daisy chaining solutions that are coping strategies at best, filthy, dirty tech debt bug ridden midden heaps at worst.

    IoT data is huge in human readable format but it's actually highly disciplined and tightly defined in scope. Want to see something amazing with column stores? Input oodles of data from a set of sensors into it and be amazed at the compression achieved.

    Yes, there are genuine Big Data generators and consumers out there, particularly in the hard science and high end financial sectors.

    Personally I'd be curious what people estimate they could get out of their existing data vs what they are currently getting.

  • David.Poole (6/9/2016)


    Yes, there are genuine Big Data generators and consumers out there, particularly in the hard science and high end financial sectors.

    Personally I'd be curious what people estimate they could get out of their existing data vs what they are currently getting.

    Here is my bit on that question.

    There is also the marketing sector, specifically the digital marketing sector that deals with large amounts of data depending on what your focus is. I'm in advertising. The amount of data we deal with in advertising is what we are getting versus existing. If you deal with large campaigns, you're talking about data that is likely touching everyone in this forum community at some point in their lives. You see an ad, you're in someones database if you can be tracked.

    The usage of that data is pretty high because it's what Google, Bing, Yahoo, and so forth are telling us how we are performing. If not to tell us how we are performing, then it's being used to create new data products that support existing growth such as the ability to make our own decisions and contribute our own data back into the ecosystem.

    That being said, I estimate we are receiving a lot of high value from data we are getting. It's higher than what we have existing because it feeds the existing. It's the pipeline from the many data oceans out in the world into our own data reservoir. Without the pipeline, the existing likely dries up and the people dehydrate.

  • Gosh darn it I'm feeling my data is pretty undersized right now :unsure:

  • David.Poole (6/9/2016)


    The data growth isn't coming from OLTP.

    True. I'm glad most of the data in my company still resides in normalized relational databases, none of which exceeds a TB in size. Everything about Big Data, IoT etc. seems confusing and blurry to me.

  • Web analytics does generate a large volume of data relative to the OTLP activity that the web site produces. Products such as Speedtrap shred out the data including any custom events and properties you have asked it to track and puts it into a dimensional schema. I'm not sure of the figures but 20:1 would not surprise me.

    The trick is to know how long to retain the information for. It depends on how fast your website changes. If your website changes radically over time then the value of the web analytics data for generation x-2 and earlier decreases rapidly.

    We know that storage capacity is cheap, storage performance less so. My experience suggests that we retain data "just in case". We get castigated in the 1 in 100 times when a need arises and the data has expired. If you quiz the castigator you will probably find that

    a) The requirement to store the data was never stated

    b) The cost of storing the data (including HA/DR) has not been discussed outside of the IT department

    c) The castigator didn't pay.

    I've certainly seen auditing solutions (for unstated requirements) that held many multiples of the volume of the active data. Everyone was scared to sign off the purge of the data but no one had every actually used it!

    The other thing I'm becoming skeptical about is the worth of the data in terms of enabling some sort of action or decision to take place. I've seen loads of pretty graphs but beyond being aesthetically pleasing I'm not sure what use the information they represented was.

  • Heading over I had noted while reading that the idea of filtering data before storage reaches into some hobbies. In model rocketry for instance, there are inexpensive altimeters that read and record altitude and sometimes g force multiple times per second. But again, they are filtering out noise. Part of that came before storage as the events they initially were looking for relied on good reads and needed the filter.

    In reading the comments it occurs to me that more and more businesses claim to need data that was unfathomable not so long ago. And further, businesses are not doing proper analysis of said data. They extract something they believe might be meaningful, throw it on a chart that looks right and eye ball some conclusions. Meanwhile they pick out "significant" events that in reality have no real statistical significance. (As in, the event lies with in standard deviation.)

    Perhaps it all lies in people searching for the event my teachers insisted would happen. That computers would make my life easy one day. They certainly have given me a good career, but easy just doesn't seem to come to mind.

  • David.Poole (6/9/2016)


    The trick is to know how long to retain the information for. It depends on how fast your website changes. If your website changes radically over time then the value of the web analytics data for generation x-2 and earlier decreases rapidly.

    Web analytics that I work with is typically stored for the past 2 fiscal years to do year-over-year comparisons. This is pretty much the standard for all high-end websites across multiple verticals.

    David.Poole (6/9/2016)


    The other thing I'm becoming skeptical about is the worth of the data in terms of enabling some sort of action or decision to take place. I've seen loads of pretty graphs but beyond being aesthetically pleasing I'm not sure what use the information they represented was.

    Worth of data that is driving a conversion such as a purchase, a download, a signup? It's pretty critical. Sure, people make pretty graphs to show performance. This is just that. Showing what has happened. That's not the same as leveraging data to show why it's happening. The why goes beyond a pretty visual.

    Most of the web data I leverage is not done just in silo's. It's leverage with everything else supporting that web propriety. You're talking about on-site analytics versus off-site analytics. Merging them together and trying to understand holistically, what is truly driving what? That's a whole heap of data and analytics.

    If you find yourself wondering what use your data is around web analytics, you're likely doing it wrong or simply don't understand the data and it's purpose just yet. It's very easy to not see the value in basic performance metrics. You got to go beyond that in order to unearth what is truly happening and most importantly, why.

  • kiwood (6/9/2016)


    Heading over I had noted while reading that the idea of filtering data before storage reaches into some hobbies. In model rocketry for instance, there are inexpensive altimeters that read and record altitude and sometimes g force multiple times per second. But again, they are filtering out noise. Part of that came before storage as the events they initially were looking for relied on good reads and needed the filter. ...

    I wonder how much my wife's telescope's sensors filter jitter. They have two monitors that are constantly updating strip charts of telescope telemetry, including current draw. For example, if current draw spikes unusually high during an azimuth transition, it might indicate a build-up of moth carcasses on the rotator collar that need to be cleaned off.

    I know astronomers do huge amounts of post-processing of data when they examine the results of observing an object. When the Sloan Digital Sky Survey was doing its thing initially a decade ago, they collected so much data that they streamed it directly to 5-6 DLT tapes then every day shipped the tapes to U of Washington for download and analysis. I'd love to know what they did in Washington with that amount of info coming in daily.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • A few years ago we were out to dinner with some old friends from California and my wife's friend's husband turns to me and informs me that a glass of wine a day has been proven to be good for you. He then showed me a bottle pills where each one gave you 50x times the daily recommended dosage because... that's even better right? Smart huh? I changed the subject.

    Most business' and governments don't even use what little information they do have effectively let alone a veritable mountain of it, besides it's just an input and a useless one if the model sucks. The more cable channels I got over the years the more time I spent trying to find something to watch. I don't watch TV anymore. Coincidence? Maybe. Now I just buy something if I want to watch it.

    Logging and auditing is an easier fit for most organizations but it's an expense and not a potential revenue generator like BI might be (if done right) so the marketing message doesn't emphasize that but it's probably a more practical justification for dramatically expanding storage.

  • I predict that within 5 years the Internet of Things, Social Media, and Big Data will be passé. Companies just can't find a way to build profitable business model around it.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • xsevensinzx (6/9/2016)


    Worth of data that is driving a conversion such as a purchase, a download, a signup? It's pretty critical. .

    Yes I understand that, and that having a link between web analytics data and OLTP data is pretty darn essential.

    I get that identifying behaviour that indicates the person is almost ready to buy is important and then identifying the action we can take to tilt them into making the purchase.

    I am also aware of the risks of coincidental correlations. Excited marketeers pinning their hopes on a correlation they have found only to find it doesn't hold true when you implement something based on the observed relationships. It's the old saw about 50% of all advertising spend being wasted but no-one being able to identify which 50%??

    There's also the bit where advertising execs claim every sale as a result of advertising spend. I've known someone bragging about a 13% increase in sales and get a massive bonus. Only later did someone figure out that the market had grown by 16% so we were actually at a worse position than if we had done nothing

  • David.Poole (6/9/2016)


    xsevensinzx (6/9/2016)


    Worth of data that is driving a conversion such as a purchase, a download, a signup? It's pretty critical. .

    There's also the bit where advertising execs claim every sale as a result of advertising spend. I've known someone bragging about a 13% increase in sales and get a massive bonus. Only later did someone figure out that the market had grown by 16% so we were actually at a worse position than if we had done nothing

    Times are changing due to the way we are able to track users. We can surely tie more and more people to the adspend dollar in a single channel and across multiple channels. Google is everywhere.

  • I totally agree with the statement that hardware sensors must make decisions on what data to capture.

    I am mostly involved in industrial automation where it is a real-world problem. It is easy for a supervisor or production manager to adjust reporting results in an Excel spreadsheet before presenting it at a daily or monthly meeting. When data are fed directly from the control system into a database and reports are always available to employees via web-based reporting tools, data integrity becomes very important.

    At the moment we are in the 4th industrial revolution (https://en.wikipedia.org/wiki/Industry_4.0) with a major shift into IOT and cloud computing. “Smart” sensors that are able to think like humans do, will be a key part of this revolution. These sensors must have multiple sensing abilities (optics, thermal, sound, vibration etc.) as well as machine learning (AI) algorithms.

    And very cheap so that it can be installed in thousands of factories across the world.

    Then the burden of cleaning and verification data will not be the sole responsibility of the server and/or database.

Viewing 15 posts - 1 through 15 (of 29 total)

You must be logged in to reply to this topic. Login to reply