• Hive and Pig both generate map reduce code under the hood. Where optimisation is required you can grab the output code and adjust it to your needs.

    Hive has something called Serdes (Serializer/Deserializer). These tell it how to talk to different file formats. In theory you could write a serde that would enable you to use SQL to query a JPEG file for the XIFF information.

    I believe that a standard installation of Hive comes with serde for delimited files, XML and JSON.

    The use of SQOOP enables a RDBMS to be read and the data distributed on HDFS with the structural metadata captured as Hive metadata. To all intensive purposes it provides a means of dumping data from traditional SQL sources and maintaining the structure and familiar language.

    SparkSQL is probably a better bet. I believe it can use the HIVE metadata but does not rely on mapreduce. In effect it builds an execution plan and then chooses to execute it which gives a big performance boost. A lot of work has gone into SparkSQL to make it ANSI compliant SQL.