Getting ready for Hadoop

,

I wrote up my first experiments with Hadoop back in June 2013.  I did what most people do in an experimentation lab.

  • Write variations of the word count example
  • Import/Export data from an RDBMS using SQOOP
  • Experiment with web logs

I have since spent a lot of time researching Hadoop from a business and architectural point of view as a precursor to using it in a production environment. My research has focussed on the following

  • Formal workshops with the four distribution providers of Hadoop and also big traditional vendor offerings of Hadoop
  • RFI/RFP processes
  • Informal evening sessions such as the Manchester Hadoop MeetUp group
  • Talks with companies who already use Hadoop in production.
  • Technology spikes and proof of concept activity
  • A great deal of reading

I have also spent time with business stakeholders, and a wide range of colleagues working out how Hadoop can be used as a tool to help our business achieve what it has set out to do in its business strategy.

I have learnt a great deal during this process and hopefully some of what I pass on in this article will be of use to you.

Step One:  Use Case & Business Case

For the adoption of any technology you must have a clearly defined use case and that use case must be backed by a solid business case.  If you don’t have this then walk away, you are not ready to adopt that technology.  Although both business and use cases are big subjects in their own right they are so intertwangled that each should make constant reference to the other.

It is perfectly reasonable to perform a proof of concept in a sandbox environment such as those provided by Hortonworks, Cloudera or even an AWS EMR or Azure HDInsight environment.  Be very careful that the difference between sandbox experimentation and real world use is understood and acknowledged.

Much is made of Hadoop being low cost because it runs on commodity hardware and is open-source.  The reality is that it is less expensive than the equivalent size of MPP data warehouse approach but still expensive.

Some of the parts for the business case I will expand upon in subsequent steps in this article but for now the ones I would identify as key are as follows:

Attribute

Description

Business Stakeholder

A BUSINESS STAKEHOLDER IS ESSENTIAL.

The number one failure point for any data warehouse project is not having a business stakeholder and Hadoop is closely aligned with data warehousing.

Even if you are planning on near real-time streaming, HBase or Accumulo do not proceed without a stakeholder.

The business stakeholder will help shape the business case

Objective

From the perspective of the business stakeholder (and thus the business) what problem will Hadoop solve?

We all want exciting technology on our CVs but if you don’t focus on business needs and desires you won’t make your case. 

Urgency

A healthy business always has more ideas than it has resources to execute.

Why should your Hadoop idea be chosen over one of the other ideas?

  • How big is the business problem you are trying to solve?
  • Is that problem getting worse, staying the same or diminishing?

If the problem is diminishing then so is your business case.  If your problem is staying the same then if the business has lived with it up until now then why can they not live with it further?

Budget & ROI

As ever, how much is it going to cost, how much will it make or save?

You may have a valid use case that will reduce costs and/or generate revenue but not to the extent where it outweighs the cost of doing so.

Timeline and milestones

How long will it take to start delivering business benefit?

What are the milestones before it starts delivering business benefit?

Is there a road map with milestones for increasing business benefit?

Teams and resources

Who do you need in order to carry out your objectives and sustain Hadoop in a production environment?

What else do you need?  This can be room in a data centre, network changes, security considerations and even changes to business process.

The items above may seem daunting but unless you can answer those questions to the satisfaction of the stakeholders you are unlikely to get the go ahead.

For the use case there are also a number of key attributes to be considered and these will dovetail or even overlap with the attributes of the business case.

Attribute

Description

Data volumes

You are going to need to know how big a Hadoop cluster you are going to need.

Data archiving is compute light but storage heavy

Machine learning can be compute heavy but with lower storage requirements.

The vendors I talked to were at pains to suggest that real world use should not exceed 2TB of usable data per node.

If you go beyond the 2TB recommendation then rebalancing a cluster that has lost a node is going to take some time and may threaten your recovery SLAs.

Criticality of service

Are you offering 24/7 operation?

If your cluster was unavailable

  • How long would you have to fix it?
  • How much data would you need to be able to resume critical services?

Security

  • Do you need to encrypt the data?
  • Do you have data with differing security classifications that require special handling?
  • Do you need single sign-on or authentication using Active Directory
  • Does different classifications of data require different encryption keys?
  • Are you restricted in where you can store data i.e. European data not stored in US cloud.

Frequency of activity

  • When is your cluster going to be active? (See Criticality of Service)
  • How frequently do you load data?
  • What batch jobs do you run and on what schedules?

Data source/Targets

Where does the data come from and where does it flow too?

Do source/target systems place any constraints on your use case?  An example may be that a source system is only updated hourly therefore promising a 5 minute update cycle is setting yourself up for failure.

The use case and business case attributes described so far will shape absolutely every decision in every stage of a Hadoop implementation project.

Step Two:  Which distribution or ecosystem?

This may also be pre-decided for you simply because of the relationships that already exist with your vendors.  For example if you are dependent on Microsoft then HDInsight (based on the Hortonworks stack) might be a foregone conclusion.

If your use case is fairly basic and you have the in-house skills then perhaps Apache Hadoop plus a few of the supporting tools will suffice.

Having sat through many workshops and presentations all the ecosystem providers offer a compelling reason for choosing their distribution.  Your task is to decide which compelling reason best fits the culture and need of your organisation. The table below

Distribution

Selling point

Hortonworks

  • Fully open-source and with no vendor lock in.
  • Fully tested at Yahoo on 24,000 nodes

Cloudera

Primarily opens source but with Cloudera Manager & Director as value add components for managing and monitoring the Hadoop cluster.

At the time of writing they are the most popular distribution

IBM

  • GPFS file system instead of HDFS
  • BigSQL fully ANSI SQL compliant
  • IBM bespoke management components and integration into other IBM products

MapR

  • Re-engineered Hadoop.
  • Posix compliant file system instead of HDFS
  • Re-engineered HBASE
  • Name node metadata stored on the data nodes thus increased resilience.

MapR needs a special mention for their approach.  They made a couple of observations of open source Hadoop

  • HDFS is a file system written in Java (therefore running in a JVM) that sits on top of the Linux file system.
  • HBASE is a written in Java (therefore running in a JVM that sits on top of HDFS sitting on top of a JVM sitting on top of the Linux file system.

They decided that rewriting the base file system made most sense rather than having a file system sat on a file system.  Similarly their version of HBase sits directly on top of the file system.  This does give MapR a performance advantage.

Step Three – Which Vendor?

If anything the choice between the distributions is made more confusing by the vendors and their partnership strategy.  When I started looking at partners of the non-IBM distributions each Hadoop partner had aligned themselves to a single distribution.  For example, Teradata originally aligned to Hortonworks but now align to both Cloudera and MapR.  SAS was aligned to Cloudera but not also align to Hortonworks.

When vendors had a partnership with a single distribution they were quite upfront as to why they chose the particular distribution. Now they are more reticent as they are tending to support more than one distribution.

In all honesty I think it boils down to the cultural fit between your organisation and the ecosystem provider.  For your first 12-24 months to argue on technical merit is analogous to a couple of 17 year olds who haven’t passed their driving test arguing about the merits of McLaren, Ferrari and Lamborghini.

Step Four:  Whether to choose an appliance

To me this is a bit of a bizarre choice.  The whole point of Hadoop is that you can add worker nodes as and when your requirements dictate.  If you have a short term need then you can temporarily add a set of cheap servers and then remove and repurpose them if the need evaporates.

Pros

Cons

  • Pre-configured, plug it in and go
  • Sized to be an operationally viable solution
  • Usually part of an integrated solution
  • Greater emphasis on manageability with additional tooling supplied.
  • Big spend up front
  • Big spend to expand
  • Inflexible expansion
  • Uncertain strategy for adoption of new ecosystem members such as Apache Spark or Apache Atlas
  • Specific cabinet or rack required in the data centre.

If your use case and business case are compelling then perhaps ease of deployment and maintenance may make the big upfront investment an acceptable option for your organisation.

Always remember to check with your infrastructure guys that it is physically possible to accommodate an appliance.  I know of a case where an appliance turned up as a result of a high 6 figure spend and the infrastructure manager took the CEO into a very full server room and asked “where do you expect me to put that appliance then”?

Step Five:  On premise or in the cloud?

Many organisations go through a cycle.  They start off on premise then as the cluster grows they break out into the cloud and then when further growth takes place and the costs get prohibitive they make a commitment for on premise capability.

The beauty of the cloud is the promise of flexible capacity.  With Hadoop that isn’t quite as straight forward as you might think.  Flexing up is relatively straight forward.  Flexing down is something to be undertaken with great care.

If you are going to be running Hadoop in the cloud then there are a number of factors to consider.

  • Legal constraints
  • Security requirements
  • Dependent systems
  • Data connectivity requirements
  • Ongoing running costs
  • …etc

If your Hadoop cluster is in the cloud then how are you going to get the data up into the cloud in a secure manner?

If your consuming systems are not co-located with your Hadoop cluster then how are they going to communicate with that cluster in a way that is secure and performs well?

With the benefit of hindsight I would advise picking your first use case to be as simple as possible and one that can be satisfied by AWS EMR or Azure HDInsight.

Step Six:  Sizing and capacity

Back in 2013 I expressed the opinion that a Hadoop needs to reach a certain size before it offers any advantages over alternative approaches.  I have since seen articles that offer the opinion that a 20 node system should be considered the minimum size for serious use. 

At this stage I lack the experience to know if this is a realistic assessment or a FUD (fear and doubt) piece.  What is beyond argument is that the higher the number of nodes in your cluster the less impact the loss of a single node will have.  The following are factors you need to take into consideration:-

Factor

Why

Replication factor

The default replication factor in a Hadoop cluster is 3 therefore the amount of raw storage across your cluster should be divided by 3.

Compression

Hadoop does offer a range of compression formats.  You will want to experiment to find out how these compare to SQL Server compression.

You will also need to be aware of what and how the different Hadoop ecosystem tools do with regard to compression.

Working storage

Data handling in Hadoop tends to carry out jobs in steps persisting data into files at each step.  Depending on your use case you are likely to need a great deal of working storage over and above your long term data storage requirements.

SQOOP in particular works by doing a smash and grab on your source RDBMS to land data onto HDFS with minimal impact on the source system.  Having done so it then copies the landed data into to a new location in HDFS in the desired format.

I have seen a suggestion that Hadoop data processing be thought of as four stages:-

  1. Landing area
  2. Pre-processing, refining and marshalling area
  3. Long term store of refined data
  4. Data in a form best suited to the consuming applications

Data aggregation

Hadoop works well with large files.  It will mechanically work with lots of small files but the overhead of opening and closing those files will have an impact on performance.

You are likely to need to build processes to aggregate lots of small files into fewer larger ones.  This takes space.

Although the different vendors have been emphatic in recommending that individual node size should be kept below 2TB per node they have also been quite forthright in stating that the 2TB limit is for using Hadoop as a data archiving store.  Realistically to consider a 10TB Hadoop cluster you should probably be looking at 30 to 40 nodes, possibly more if your use case is intensive.

Forty cheap nodes (plus peripheral nodes for support services) adds up to a considerable sum of money.

Step Seven:  Acquiring Skills and Retaining Them

The Hadoop ecosystem is a complex beast.  Unfortunately at the time of writing (October 2015) it is a sellers’ market for skills with demand outstripping supply.

Realistically you are going to have to train up personnel and probably supplement them with contract resource.  The table below shows what I believe you need and why.

What

Why

Upfront training of permanent staff

Evaluation of technology carried out with at least some hands on experience with the team who will eventually run it.

Staff will have business domain knowledge and the context into which the new technology will be deployed.  They are likely to be able to give an accurate assessment as to whether the technology will be a white elephant or a golden goose.

£10K - £15K spend on training may sound like a lot but if it identifies the technology as a white elephant it is likely to save you money in the short to medium term.

You are going to have to train staff anyway and doing so early means that adoption is likely to be smoother and therefore ROI is delivered earlier.

External formal classroom courses

Although potentially higher cost the delegates (and therefore the organisation) will benefit in two ways:-

  • Training is more effective without distractions of normal day to day operation
  • Broader spectrum and cross pollination of ideas with delegates from other organisations.  Personally I find this to be of huge value

Learning lab facilities:-

  • Hands on virtual machines
  • Repeatable tutorial materials
  • Desktop machines of sufficient power to cope with the VMs (if run locally).
  • Decent internet connection speed (if run in the cloud)

The complexity of the ecosystem is such that the evaluation team need to be able to repeat their lessons.

They also need to experiment, break stuff, try and fix it, learn from their mistakes in a safe environment.

If you pursue the MOOC (Massively Open Online Course) approach using facilities such as Udemy, Lynda and Coursera then a decent internet connection is essential.

Contract resource

If you have gone for one of the more expensive options then there will be pressure to show ROI early

Contract resource ca be focussed on achieving ROI and should help take the pressure off staff trying to get to grips with operating new technology in a production environment.

If you cannot afford the labs and the training then again I would question whether your business case stacks up or whether your organisation is ready to adopt the technology!

Given the sellers’ market you have to take a realistic look at staff turnover within your organisation.  It is inevitable that you are going to lose a few people if they acquire highly marketable skills but those skills are highly marketable for a reason!  They are perceived as offering competitive advantage.

Manager One:  What happens if we train our staff and they leave?

Manager Two:  What happens if we don’t train them and they stay?

It is quite possible that the most difficult aspect of adopting new technology is that it may demand significant organisational change in order to be successful.  Looking at Hadoop in particular the speed of evolution in the ecosystem is such that learning and skills acquisition need to be part of the day to day job.  It cannot be part of an often promised but seldom delivered training programme.

Concluding thoughts

There is a great deal to think about when adopting Hadoop and the majority of those thoughts are business and organisational issues rather than technical.

We can only influence first use cases if we have built up credibility with our business colleagues but my personal preference would be for a first use case with low risk that asks for periodic batch processing.  A set of low risk, low cost use cases should make the business more amenable to the idea that there is as much to learn from failed experiments as there is from those that succeed.

I found the Manchester Hadoop MeetUp group to be especially useful as there is a chance to discuss ideas with peers outside of the organisation.

Rate

4.65 (17)

Share

Share

Rate

4.65 (17)