A Good Use for Hadoop

  • David Grover (2/25/2016)


    There's an assumption that "dumping it into Hadoop" is the equivalent of "throwing junk mail on the pile for later" or "copying the contents of my desktop to the corporate file share." Its not, or doesn't have to be. You can certainly do that with a hadoop cluster, but to make actual use of it you still need to structure the data. Even if its only enough for the MR algorithm to get some traction, you still need some structure to get some traction. Usually that structure is some variation of a dimensional warehouse model, often in "super transaction" form.

    David, I couldn't agree with you more. What I explained in my earlier post is the way that too many people look at Hadoop. They probably get the idea that I'm against Hadoop but I'm not. I'm in full favor of using the right tool correctly for the right job.

    Now, people think there's some magical way to turn a data lake into profit by waving the Hadoop wand at it. Especially Marketing people. This is much the same kind of thinking you see in application developers who think that if they're just allowed to use flexible-schema systems like Mongo or Couch they can do JIT semantics.

    You also identified the issue that too many think of Hadoop as just magic that somehow works without effort. Any data project of any complexity requires quite a bit of work regardless of the database platform. You have to put structure in the data at some point to get any value out of it.

  • Eric M Russell (2/25/2016)


    Steven.Grzybowski (2/25/2016)


    Eric M Russell (2/25/2016)


    If we divide 1.5 PB (15,000,000,000,000,000) by the total number of global NetFlix accounts (60,000,000), that equals ~ 250 MB (250,000,000) of event data gathered daily per account. Also, if we assume that 1/2 of those accounts login and stream video on any given day, then that's ~ 1/2 GB of event data per account daily.

    Does this mean that NetFlix is gathering nearly as much data from subscribers as they are streaming to subscribers? Maybe we should be putting black masking tape over our webcam while watching NetFlix, because I know I'm not clicking 1/2 GB worth of event data while sitting back on the sofa.

    I would guess a huge portion of that data is not so much from users, but also diagnostic data from their aws, probably some information about the quality of the streaming, etc. I would imagine Netflix would want some data about what regions are encountering latency issues due to AWS DC locations, outages and all that wonderful stuff.

    Here is how NetFlix describes the nature of the event data. At the top is something labeled "Video viewing activities".

    There are several hundred event streams flowing through the pipeline. For example:

    - Video viewing activities

    - UI activities

    - Error logs

    - Performance events

    - Troubleshooting & diagnostic events

    Note that operational metrics don’t flow through this data pipeline. We have a separate telemetry system Atlas, ...

    Did not see that. I am at work, and don't have full internet access. That is a bit disconcerting.

  • Eric M Russell (2/25/2016)


    ...Maybe we should be putting black masking tape over our webcam while watching NetFlix...

    Please do not elaborate!!!

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • Michael Meierruth (2/25/2016)


    Maybe the birth of hadoop was motivated by the Netflix contest back in 2006.

    I drove into this contest with immense passion and I remember doing lots of things with SQL.

    But the hardcore stuff was done with low level code.

    Boy was this fun stuff!

    Dealing with 100 million rows is never easy and quick.

    What's the most number of rows you ever had to deal with in the real world using sql?

    Hadoop was alive in 2006. It actually came from Yahoo, and research based on what Google released about Map Reduce and web crawling.That was the original idea: https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/

  • Jeff Moden (2/25/2016)


    Apparently, all that predictive analysis with a PB per day didn't do squat for Netflix. Their stock dropped from a high of ~131 on 4 Dec 2015 to ~83 just two months later. That's a 37% drop in just two short months.

    To coin a phrase, "Just because you can store everything, doesn't mean you should". It also continues my belief that, unless management actually listens to what the numbers are predicting, BI is an oxymoron. πŸ˜‰

    Stock price isn't necessarily causally related to their technology. I wouldn't use that as a sign of one thing meaning the other. Stock is often related to outside influences as often as inside.

  • Steven.Grzybowski (2/25/2016)


    Eric M Russell (2/25/2016)


    If we divide 1.5 PB (15,000,000,000,000,000) by the total number of global NetFlix accounts (60,000,000), that equals ~ 250 MB (250,000,000) of event data gathered daily per account. Also, if we assume that 1/2 of those accounts login and stream video on any given day, then that's ~ 1/2 GB of event data per account daily.

    ...

    I would guess a huge portion of that data is not so much from users, but also diagnostic data from their aws, probably some information about the quality of the streaming, etc. I would imagine Netflix would want some data about what regions are encountering latency issues due to AWS DC locations, outages and all that wonderful stuff.

    I think this is correct. Reading a few of their blog items, this is a lot of infrastructure data as well as user data.

  • One domain that works with large scale is digital marketing.

    In my own data, I push around 1 billion+ records with above 30 million of record ingestion a day on a single instance on premise. The usage is pretty frequent of this data because it's log-level data used for analytic reporting where users typically query 100's of millions of records daily. Due to the large read sizes and the more advanced computations of those reads, SQL Server struggles a lot being a single VM with only so much CPU and memory.

    Due to that, MPP (massively parallel processing) infrastructures are very popular to split those massive reads and computational problems across multiple machines. Azure, AWS or just your own personal cluster of NoSQL is ideal to handle the workload of solving these so called, "Big Data" problems.

    My system handles a billion records well in terms of ingestion and data management. The shortcomings is processing the massive scale of data. The only option is to constantly find ways to partition and summarize the data. But, that's easier said than done. So, I'm looking into Azure Data Warehousing as my new data warehousing solution in the cloud and likely will move away from on premise because it's just not flexible enough.

  • I've found Netflix's rating predictions to be very accurate. It's enough that if they predict 3 or fewer stars for me, I won't bother watching the movie.

    I'm sure they're up to all sorts - e.g. if they're promoting their new show, they are likely measuring how long I leave the ad on screen, if I let the trailer or show auto play (and for how long), if I click on the show in any way (browse the episodes, etc.)

    Leonard
    Madison, WI

  • Steve Jones - SSC Editor (2/25/2016)


    Jeff Moden (2/25/2016)


    Apparently, all that predictive analysis with a PB per day didn't do squat for Netflix. Their stock dropped from a high of ~131 on 4 Dec 2015 to ~83 just two months later. That's a 37% drop in just two short months.

    To coin a phrase, "Just because you can store everything, doesn't mean you should". It also continues my belief that, unless management actually listens to what the numbers are predicting, BI is an oxymoron. πŸ˜‰

    Stock price isn't necessarily causally related to their technology. I wouldn't use that as a sign of one thing meaning the other. Stock is often related to outside influences as often as inside.

    That's EXACTLY the point I was trying to make. They do very well on predicting which movies customers will like but all that technology hasn't been applied to predicting if they'll continue to have customers. πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • During a recent family road trip, my kids asked to keep the iPhone with them in back seat so they could watch NetFlix. Over the course of two days the data overage charges totaled up to $130. I'm guessing I ended up paying at least $50 for event data. Adding insult to injury, NetFlix is now reccomending me stuff like 'Scooby Doo' and 'Beverly Hills Chihuahua' based on my recent viewing activity. :crying:

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • We experimented with Hadoop to build a personalised product sort order.

    5million customers x 250k product variations produced a huge dataset.

    We had an impressive poc, and were experimenting with Apache Spark in an attempt to generate the results in the required time frame.

    I asked the question "how many unique combinations of attributes can we actually have"? This turned out to be considerably less than the 1.25trillion records we originally considered. Then we did some cluster analysis and verification and found that far from 5million customers we had 45 distinct classifications. All of a sudden a Hadoop sized problem became a Postgres sized problem.

    As to how often you query huge recordsets. It depends what you do with data that lands in Hadoop. A merge-up process that takes 30 single day files and merges them to form 1 thirty day file can yield big performance benefits

  • I find this kind of thing all the time. There's a use case for Hadoop, but most of the time people want to throw a huge amount of time and money at a problem that could be solved much faster and simpler with a little thought.

    Ten years ago a Tb of data was a lot of data, but it was still queryable, indexable and highly manageable. Now its pretty standard. Would I put Pb of data into SQL Server or Oracle RDBMSs? No, but do I really actually seriously need all of those Pb? Figure that out first before spending a lot of effort on something that's going to end up being SQL on MR anyway.

  • Eric M Russell (2/25/2016)


    ...NetFlix is now reccomending me stuff like 'Scooby Doo' and 'Beverly Hills Chihuahua' based on my recent viewing activity. :crying:

    Use a web browser and login to NetFlix.com and there is a page for editing the previously viewed items that are used for the basis for recommendations.

    I had EXACTLY the same problem and whilst I was happy to put up with the investigating gang I was far from happy with being recommended the compact Californian canine.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • Gary Varga (2/26/2016)


    Use a web browser and login to NetFlix.com and there is a page for editing the previously viewed items that are used for the basis for recommendations.

    I had EXACTLY the same problem and whilst I was happy to put up with the investigating gang I was far from happy with being recommended the compact Californian canine.

    Thanks for the info, Gary! I'll be getting Netflix in the not distant future, basically when Daredevil season 2 starts up (my wife was turned on to it and has become addicted). We're gearing up to ditch our satellite TV, for which I bought an Apple TV back in November. I'm waiting for Amazon Prime and BBC America to get their apps up. But since there's no new Dr. Who or Sherlock for 2016, who knows what'll happen.

    -----
    [font="Arial"]Knowledge is of two kinds. We know a subject ourselves or we know where we can find information upon it. --Samuel Johnson[/font]

  • Half a GB of clickstream data per NetFlix user every day should raise some eyebrows. When I read about these PB and EB scale Big Data projects, I think about the earliest construction architects, the ancient Egyptians, building their pyramids by stacking one three ton block on top of another with the end result serving no other purpose other than glory or to prove to themselves that it can be done. Constrast that with modern day sky scapers that are much more agile and practical. The field of Big Data isn't at that point in it's evolution yet, they're still just stacking PB to see who can build the grandest monument.

    There are a variety of other video streaming websites out there which offer a similar experience (if not the same breadth of content), and they probably don't bother collecting and warehousing clickstream data. That 1.5 PB of data that NetFlix collects daily from it's users; does it really give them a competitive edge, or are they just hoarding data?

    I know there are some highly relevent projects like genomics and weather forcasting that rely on the analysis of huge volumes of data while at the same time benefiting society in general, but do many of these Big Data projects, especially those in the commercial and mass marketing sector, really matter?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply