• David.Poole (8/27/2015)


    Hive and Pig both generate map reduce code under the hood. Where optimisation is required you can grab the output code and adjust it to your needs.

    Hive has something called Serdes (Serializer/Deserializer). These tell it how to talk to different file formats. In theory you could write a serde that would enable you to use SQL to query a JPEG file for the XIFF information.

    I believe that a standard installation of Hive comes with serde for delimited files, XML and JSON.

    The use of SQOOP enables a RDBMS to be read and the data distributed on HDFS with the structural metadata captured as Hive metadata. To all intensive purposes it provides a means of dumping data from traditional SQL sources and maintaining the structure and familiar language.

    SparkSQL is probably a better bet. I believe it can use the HIVE metadata but does not rely on mapreduce. In effect it builds an execution plan and then chooses to execute it which gives a big performance boost. A lot of work has gone into SparkSQL to make it ANSI compliant SQL.

    That's ideally the hurdle I feel many RDBMS people face. Having to write complex Java in order to absorb a JSON file into Hive. It's the same as writing complex SQL to do the same, but the point is, RDBMS already know SQL. Having to learn a OOL in order to get data from point A to point B is a daunting task that requires retraining or new hires.

    I personally enjoy working with NoSQL (Hadoop). But, coming from using primarily SQL Server to jumping right into Hadoop has been a daunting task due to all the components currently available and coming down the pike as well the expansion of knowledge you have to undertake just to leverage the toolset correctly.