SQLServerCentral Article

A Quick Look at the MPP World

,

Introduction and overview of the MPP market place.

Over the past few months I have been looking at data warehouse appliances such as

Neteeza and Teradata. Teradata and

Neteeza have been around for a long time and are used by some of the largest

commercial companies in the world. Green Plum is a relatively new kid on the block but shows incredible promise and

as they have recently been acquired by EMC2 they look like an interesting

alternative. In fact the Gartner report,

Magic Quadrant for Data Warehouse Database Management Systems, is available

on the Green Plum web site.

Dataupia have an interesting proposition

in that their product parallelises customer existing platforms whether it is SQL

Server, Oracle, DB2, etc. DATAllegro was acquired by Microsoft which resulted in Project Madison which we

now know as SQL2008R2 Parallel Processing Edition.

So the Massive Parallel Processing market is looking very interesting at the moment.

Microsoft's acquisition of DATAllegro and the subsequent release of SQL2008R2

Parallel Processing Edition is potentially a very significant step in the data warehousing

world. The main thrust of development for SQL2008 was the BI stack and the SQL2008R2

MPP has to be viewed as a weighty foundation stone underpinning that stack.

What are Massive Parallel Processing appliances?

In a nutshell the end user sees a single database server but under the hood it is

many database servers operating in tandem (parallel) to deliver blistering performance

and massive scalability. Products like Kognitio WX2 and Neteeza are delivered as a populated rack of blade servers.

The different appliances use a variety of methods to achieve the same ends.

  • Some offer shared-nothing architecture where each "node" within the appliance

    is unaware of the other nodes participating in the appliance.

  • Some offer nodes that are fully aware of each other.
  • Others offer a system where there are master nodes that parcel out the work to other

    worker nodes.

In fact, when I first read about soft Numa (non uniform memory access) within SQL2008

my first reaction was that MPP appliances do this at the macro level. The appliance is not simply a bunch of servers co-operating with each other. They also manage their own storage so in some respects they act a bit like a SAN.

What are MPP appliances used for?

It would be a mistake to regard these devices as big boys toys for those organisations

with more money than sense. I think in the next 5 years we can expect these devices

to filter down into medium sized enterprises.

In a nutshell these appliance are aimed at heavy duty analytical work rather than

fast production of stock reports. They obviously can support such stock reports

but that would be like buying a Ferrari and only driving 5 miles to the shops.

Speed

Throughout my career a common bug-bear has been the user who fires off an horrendous

query against the database and while the DB server sinks to its knees that user

goes off on their lunch break. When they come back from said lunch break they take

one look at the results and cannot remember why they wrote the query in the first

place!

These people have been the bane of a DBA's existence. As a DBA I have been in

that position but now realise that it was me that was at fault and not the user

writing the query.

What that user was trying to do, and what MPP boxes make it possible to do, is to

fire off a train of thought query

  1. Query => I wonder what is going on in product line 'x'
  2. Answer leads to further query =>Ah so that mean that activity 'y' must

    be happening

  3. Answer leads to further query =>Not quite but this unexpected behaviour reveals

    a number of data hot spots

  4. Answer leads to further query =>Are their enough customers in those hots spots

    to represent a business opportunity?

  5. Answer leads to significant business insight, increase in profits, bonuses and job

    security all round!

The speed at which the MPP boxes bring back the answer to these queries allows these

question and answers to be chained together. In a traditional database the end user

might not get beyond stage 2.

Another point to bear in mind is that these train of thought queries are one-off

queries. They will probably not be run again which means the following

  • The queries will never be optimised because they are run only once
  • A cached execution plan is next to useless because they are run only once

 

Size

We have gone from the age of data being generated by human beings and is therefore

limited to the speed and capacity that humans generate data to an age where data

is generated from machines.

Any serious web site will be analysing its web traffic possibly in real time. At

the SQL Bits conference in London one of the presenters was demonstrating the use

of Stream Insight to produce cubes of data that are there to reveal the performance

of data cubes used by their business. Their business is incredibly time sensitive

so if the data cubes are not available millions of pounds can be lost in a few minutes.

This sort of telemetry represents a massive jump in the volumes of data so systems

that scale to the petabyte scale rapidly become necessary.

 

Analytics Integration

Teradata, Neteeza & HP Neoview have the facility to accept pass-thru queries

from SAS Analytics. This means that SAS can recognise that these platforms support

its core functionality and therefore leverage the intrinsic power of the MPP platform.

SAS can throw some rather nasty queries at SQL Server so formal support for SAS

pass-thru queries in the Microsoft stack is something I would very much like to

see.

Green Plum supports the R open source analytics

language.

What are the criteria for choosing an MPP appliance?

Requirements, requirements, requirements

You really need to be clear on what business problem you are trying to solve.

  • What data volumes do you have?
  • Is this data volume made up of well structured, high quality data or a mash of disparate

    data imported from legacy systems?

  • What is the projected growth of the data?
  • How up-to-date does the data need to be?
  • What query response times are needed and/or desired?

In short, know your business, know your data. Once you have a clear idea of this

you have a firm basis to get meaningful information out of the vendors.

 

Integration

Investing in an MPP platform is a major investment for the business in terms of

people, processes and technology.

My personal opinion is that, while such devices will eventually lead to investing

in new reporting, analytics and ETL tools they should initially be capable of integrating

with what the organisation already has.

MPP platforms do require a change of mindset and it is just as possible to implement

then badly as it is with any other piece of technology. Although most MPP vendors

are very good with regard to support if something doesn't work you do not want

to be caught between two vendors arguing that each other's technology is the

cause.

Strategic Partnerships

This could come under the banner of integration however it is worth asking if the

MPP Vendor has any strategic partnerships with toolset or analytics vendors. If you do find that you need to replace part of your BI stack then knowing that the vendor works well with a known range of tools is of major importance.

Physical footprint

These devices are in reality more than one physical server. One vendor highlighted

their biggest client as having a 192 blade appliance supporting 40PB of data! Questions

have to be asked as to the physical requirements for the chosen MPP.

  • Will your chosen appliance fit in your data centre?
  • What are its power requirements and can your data centre meet them?
  • What are its cooling requirements and can your cooling systems meet them?
  • What other connectivity is required for the appliance?

 

Security

The whole purpose of a data warehouse is to allow access to data but there are going

to be pools of data that are sensitive and need protecting. There are a number of

security questions that need to be asked.

  • Does the appliance support encryption?
  • Does the appliance integrate with LDAP, Active Directory etc or does it have its

    own security scheme?

  • Does it have object level security?
  • Does it support role based security?
  • Does it expose users passwords in any way? WARNING, some of them do expose users

    passwords to sys admins.

 

Auditing

There are a variety of system auditing tasks that I would expect an appliance to

perform

  • Query usage?
  • Security access?
  • System performance?
  • Dependency tracking?

 

Licensing

The different vendors follow different licensing models and you need to be clear on precisely what you are buying into. For some the appliance includes software and hardware and ANY change to the hardware requires an upgrade license. Others are more casual and allow upgrades to RAM and disk without licensing consequence. Others regard only the production environment for licensing purposes.

One other thing you might consider asking: does the vendor have a model that supports data warehousing as a service?

 

Programmability

Sooner or later you are going to want to do more involved that running straight

queries against the MPP. A few things you might want to know:

  • Does it have the equivalent of stored procedures? Not all of them do.
  • Does it have a scripting language?
  • Does it support any embedded language such as Java?
  • Does it have an API that allows extensibility?
  • Do you have the skills in house to use the API, scripting or embedded language?
  • Does it have any form of scheduler or something comparable to SQLSERVERAGENT?

 

Recoverability

It is never a nice experience to have to go through disaster recovery on a multi

terabyte system.

  • Does the appliance support a full backup strategy
    • Differential backups
    • Point in time recovery
    • Object level backups
  • How does the system react to hardware failures?
    • RAM
    • Disk
    • An entire blade going down?
  • Does it support hot swappable hardware?
  • Can the system operate in the event of hardware failure albeit in a reduced mode?
  • Does it have the equivalent of rollback/undo logs?

 

Support

  • Does the vendor offer a support contract or SLA agreement?
  • Is there a user group?
  • Are there user forums?
  • Does the vendor supply spares kit?

 

Scalability

Most appliances will operate into the high terabyte range and some well into the

petabyte range. The question is not "can they scale" but the nature of how they should scale.

  • What is the recommended ratio of RAM to disk?
  • Is there a case for adding a lot of slower RAM rather than a smaller amount of ultra-high

    speed RAM?

  • When is it more prudent to add blades rather than upgrade other blades?
  • How much disk should each blade address?
  • Are there bottlenecks in the appliance design? For example, work may be performed

    across many blades but if they are managed by a single blade then that single blade

    is the bottleneck.

  • What user concurrency can they support? Some solutions do not major on massive user

    concurrency.

One point I should like to make is that while in OLTP systems growth estimates tend

to be in the 2x or 3x over a planning cycle data warehouse growth tends to be in

the 10x or higher growth range.

 

Training

Teradata has a full certification scheme. Do not underestimate the training required for these appliances.

Given the cost of MPP appliances it is worth negotiating training as part of the

price. Also consider who needs training

  • DBAs?
  • Infrastructure specialists?
  • Analytics users?
  • Disaster recovery specialists?

 

Affordability

The purchase price is just the start although Dataupia have made "the price

you see is the price you pay" part of their selling proposition.

When talking to vendors don't make any assumptions about what the price includes.

Always be explicit in what is expected.

  • What are the ongoing licensing costs?
  • What are the maintenance/support costs?
  • Does it have any proprietary hardware requirements?
  • Are there additional modules or tools to be bought?

Some vendors will be remarkably flexible on the purchase price but ongoing costs

(total cost of ownership) is high.

 

Flexibility

Some MPP appliances are designed to be used a specific way.

  • Is the appliance suitable for near real-time data warehousing?
  • Can it run on commodity hardware or does it require specialist kit?
  • Can it run on heterogeneous hardware or does it require all hardware to be configured

    the same?

  • Can it handle online data loading or is loading and querying to be treated as two

    separate operations that must not coincide?

Flexibility can also include the way in which the appliance is used

  • If licensed on processor power is it possible to rent temporary license extensions?

    For example 11 months of the year the appliance is quite capable of coping with

    however many blades are in the appliance but for 2 weeks of the year it really needs

    another couple of blades.

  • How does the vendor treat the installation of any client tools? For example, does

    uninstalling a tool from one machine and installing it on another require any additional

    service and support?

 

Design considerations

One thing that is very important in the design of a database to be put onto an MPP

is the distribution of the data. Most appliances have some form of hashing algorithm

to distribute the data so that the maximum number of nodes can be brought to bear

on a given query.

One thing to watch out for is where you have data that has hotspots in it. For example,

if you have data for the past 12 months you do not want the most recent month to

be confined to a single blade otherwise that single blade will be attempting to

do the lion's share of the work!

My experience has been that new technologies require a new mindset in order to get

the best out of them. The vendor should be able to advise on any key considerations

that are required in order to get the best out of their appliance.

In conclusion

Before Microsoft bundled DTS in with SQL Server ETL tools were a specialist niche with a niche price to match. Potentially SQL2008R2 Parallel Processing edition could have the same effect although it has to be said that there is already healthy competition in the MPP market place.

From what I have seen so far each of these appliances have their strengths and weaknesses.

I have seen an appliance perform at "I still don't believe it" speeds

when configured and used properly and in other cases far worse than SQL Server 2000!

The next 12-24 months should prove to be very interesting for those of us interested

in the world of MPP platforms!

Rate

4.82 (17)

You rated this post out of 5. Change rating

Share

Share

Rate

4.82 (17)

You rated this post out of 5. Change rating