A Good Use for Hadoop

Question

A Good Use for Hadoop

Steve Jones - SSC Editor

SSC Guru

Points: 734418
More actions
February 24, 2016 at 9:17 pm

#311609

Comments posted to this topic are about the item A Good Use for Hadoop

Viewing 15 posts - 1 through 15 (of 31 total)

You must be logged in to reply to this topic. Login to reply

Kyrilluk Ten Centuries Points: 1269 More actions · Answer 1

The reason why Netflix needs to conserve all this data is to be able to use some data mining algorithms to predict future movie selection. So the more data, the better. Because they don't know much about us (beside our sex, age, location, the type of credit card we are using and the movie we choose (or not)), they need a huge amount of data to be able to give a fairly good prediction of the kind of movies most of us will watch till the end or the kind of series that are likely to get us hooked up.

harvey.kravis 75182 SSC Journeyman Points: 92 More actions · Answer 2

That makes sense, but I think after having rated about 800 movies they should have a pretty good idea of what I like.

Aaron N. Cutshall SSCrazy Eights Points: 8992 More actions · Answer 3

I know that most of the "big data" pundits point to this as an example of just how great Hadoop is at being able quickly to store tons of data and I certainly can't argue that. But, to my mind, that's akin to dumping loose change into a 55-gallon barrel. You can quickly and easily "store" lots of coins, but you have no idea what you have until you search through it and the only way to do that effectively involves looking at each coin.

Let's say that you want to identify how many quarters you have with a date prior to 2000. You could have a group of people each take a bucket of coins and look through them in parallel (a sharded database), but they still have to look at each coin. Then you would need someone to collect the count from each person with a bucket and tabulate the final count (the master node). This entire process is required each time you wish to perform a query.

Contrast that with the RDBMS method where an ETL process would determine where to store the coin based upon its characteristics. It is slower, but to continue the example, one or more people would look at the coins coming in and would put quarters in one bucket, nickels in another, etc. As the coins are put into the bucket, their characteristics are noted (date for example). Then when you want to know just how many quarters you have with a date prior to the year 2000, it's much easier to determine.

It all comes down to when you wish to examine the coins (or data in general): As you're storing (the RDBMS method) or as you're querying (the NOSQL method). Depending upon your situation, either may be viable. If you're dealing with a large influx of data and can't afford the bottleneck of an ETL process, then the Hadoop method is a logical choice. Some organizations will then perform an after-the-fact ETL process to filter and organize the data for better querying. With filters in place, the data isn't truly lost as Hadoop still has it, but since it may not be relevant at the moment it's still there if needed. To me that's a "best of both worlds" approach.

LinkedIn: http://www.linkedin.com/in/sqlrv
Twitter: http://www.twitter.com/sqlrv
Website: http://sqlrv.com

Gary Varga SSC Guru Points: 82166 More actions · Answer 4

harvey.kravis 75182 (2/25/2016)
That makes sense, but I think after having rated about 800 movies they should have a pretty good idea of what I like.

They so often get it wrong. Perhaps because of skewing by tenuous links between content, chancing watching something and finding it rubbish or just because we watch different things depending on who is watching (by that I am referring to watching in groups as they do have individual profiles, of course).

Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!

Michael Meierruth SSChampion Points: 10051 More actions · Answer 5

Maybe the birth of hadoop was motivated by the Netflix contest back in 2006.

I drove into this contest with immense passion and I remember doing lots of things with SQL.

But the hardcore stuff was done with low level code.

Boy was this fun stuff!

Dealing with 100 million rows is never easy and quick.

What's the most number of rows you ever had to deal with in the real world using sql?

Steven.Grzybowski Hall of Fame Points: 3214 More actions · Answer 6

For me, the worst I have to deal with is about 700 million rows (as of now) That grows by about 100k rows/day.

barnaby.self SSC Enthusiast Points: 171 More actions · Answer 7

Michael Meierruth (2/25/2016)
Maybe the birth of hadoop was motivated by the Netflix contest back in 2006.
I drove into this contest with immense passion and I remember doing lots of things with SQL.
But the hardcore stuff was done with low level code.
Boy was this fun stuff!
Dealing with 100 million rows is never easy and quick.
What's the most number of rows you ever had to deal with in the real world using sql?

Currently 2bn rows. About to start importing 1bn rows per month. Into SQL BI Edition!

Eric M Russell SSC Guru Points: 125519 More actions · Answer 8

Kyrilluk (2/25/2016)
The reason why Netflix needs to conserve all this data is to be able to use some data mining algorithms to predict future movie selection. So the more data, the better. Because they don't know much about us (beside our sex, age, location, the type of credit card we are using and the movie we choose (or not)), they need a huge amount of data to be able to give a fairly good prediction of the kind of movies most of us will watch till the end or the kind of series that are likely to get us hooked up.

Crunching predictive analytics on "~500 billion events and ~1.3 PB per day" seems like a very round about way of arriving at conclusions that can be derived using more direct and obvious means. Geeeezz! If NetFlix wants to know what potential content users are interested in watching, then all they have to do is log what keywords are being entered into the search field that come back empty or aren't yet available for streaming. Google Analytics can also tell you what movies and TV shoes folks are interested in.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Jeff Moden SSC Guru Points: 1003851 More actions · Answer 9

Apparently, all that predictive analysis with a PB per day didn't do squat for Netflix. Their stock dropped from a high of ~131 on 4 Dec 2015 to ~83 just two months later. That's a 37% drop in just two short months.

To coin a phrase, "Just because you can store everything, doesn't mean you should". It also continues my belief that, unless management actually listens to what the numbers are predicting, BI is an oxymoron. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steven.Grzybowski Hall of Fame Points: 3214 More actions · Answer 10

With my own 700M rows that I deal with, the fun part is that it is currently 97% duplicates, and We haven't been given the go-ahead by our client to dedupe it. -:blink:

Eric M Russell SSC Guru Points: 125519 More actions · Answer 11

If we divide 1.5 PB (15,000,000,000,000,000) by the total number of global NetFlix accounts (60,000,000), that equals ~ 250 MB (250,000,000) of event data gathered daily per account. Also, if we assume that 1/2 of those accounts login and stream video on any given day, then that's ~ 1/2 GB of event data per account daily.

Does this mean that NetFlix is gathering nearly as much data from subscribers as they are streaming to subscribers? Maybe we should be putting black masking tape over our webcam while watching NetFlix, because I know I'm not clicking 1/2 GB worth of event data while sitting back on the sofa.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Steven.Grzybowski Hall of Fame Points: 3214 More actions · Answer 12

Eric M Russell (2/25/2016)
If we divide 1.5 PB (15,000,000,000,000,000) by the total number of global NetFlix accounts (60,000,000), that equals ~ 250 MB (250,000,000) of event data gathered daily per account. Also, if we assume that 1/2 of those accounts login and stream video on any given day, then that's ~ 1/2 GB of event data per account daily.
Does this mean that NetFlix is gathering nearly as much data from subscribers as they are streaming to subscribers? Maybe we should be putting black masking tape over our webcam while watching NetFlix, because I know I'm not clicking 1/2 GB worth of event data while sitting back on the sofa.

I would guess a huge portion of that data is not so much from users, but also diagnostic data from their aws, probably some information about the quality of the streaming, etc. I would imagine Netflix would want some data about what regions are encountering latency issues due to AWS DC locations, outages and all that wonderful stuff.

David Grover Valued Member Points: 68 More actions · Answer 13

There's an assumption that "dumping it into Hadoop" is the equivalent of "throwing junk mail on the pile for later" or "copying the contents of my desktop to the corporate file share." Its not, or doesn't have to be. You can certainly do that with a hadoop cluster, but to make actual use of it you still need to structure the data. Even if its only enough for the MR algorithm to get some traction, you still need some structure to get some traction. Usually that structure is some variation of a dimensional warehouse model, often in "super transaction" form.

Now, people think there's some magical way to turn a data lake into profit by waving the Hadoop wand at it. Especially Marketing people. This is much the same kind of thinking you see in application developers who think that if they're just allowed to use flexible-schema systems like Mongo or Couch they can do JIT semantics. But in both cases, to make sense of the data over time you need a consistent schema to store the data in. We all know that, as database developers. And a JIT semantics is fine if you're dealing with stuff you don't need to track over time.

But Netflix is concerned with behavior over time, so they can't use a JIT semantics.

This principle is independent of anything Codd might have formulated, and that's in part because to make relational work right you need some kind of consistent and stable semantics. Codd assumes you've got it together. So the principle doesn't require relational-like-SQL-Server relational, but it still needs to be a semantics. Codd's normal-form semantics happens to be a particularly reliable to way to manage semantics in general, but any stable ontology will work without being in third normal form. (It'll be a pain to manage, but oh well.) Similarly, if you want me to report on the contents of your document model, you better make sure I've got something to report on that doesn't require daily updates to CASE statements or I'm going to get it wrong.

So Netflix is using Hadoop in a virtuous way, and one that doesn't actually violate any of our relational db instincts. Its just that the application developers who are all hot on Netflix's Hadoop cluster don't understand the implications.

Eric M Russell SSC Guru Points: 125519 More actions · Answer 14

Steven.Grzybowski (2/25/2016)
Eric M Russell (2/25/2016)
If we divide 1.5 PB (15,000,000,000,000,000) by the total number of global NetFlix accounts (60,000,000), that equals ~ 250 MB (250,000,000) of event data gathered daily per account. Also, if we assume that 1/2 of those accounts login and stream video on any given day, then that's ~ 1/2 GB of event data per account daily.
Does this mean that NetFlix is gathering nearly as much data from subscribers as they are streaming to subscribers? Maybe we should be putting black masking tape over our webcam while watching NetFlix, because I know I'm not clicking 1/2 GB worth of event data while sitting back on the sofa.
I would guess a huge portion of that data is not so much from users, but also diagnostic data from their aws, probably some information about the quality of the streaming, etc. I would imagine Netflix would want some data about what regions are encountering latency issues due to AWS DC locations, outages and all that wonderful stuff.

Here is how NetFlix describes the nature of the event data. At the top is something labeled "Video viewing activities".

There are several hundred event streams flowing through the pipeline. For example:
- Video viewing activities
- UI activities
- Error logs
- Performance events
- Troubleshooting & diagnostic events
Note that operational metrics don’t flow through this data pipeline. We have a separate telemetry system Atlas, ...

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho