• One very important point to remember is that Hadoop is great with big files. It is not so great with lots of small files.

    Hi,

    Even if you have small files, but a lot of them, you can obtain a good performance in your Hadoop system.

    Background: a characteristic of Hadoop is that computing performance is significantly degraded when data is stored in many small files in HDFS. That occurs because the MapReduce Job launches multiple task, one for each separated file, and every task requires some overhead (execution planning and coordination).

    To overcome this drawback you only need to consolidate small files. There are several options to accomplish this task; the first attempt could be the low-level HDFS API, which is in my opinion the hardest way, but some java spiders may feel comfortable with it. Another option is the Pail Library, which is basically a Java library that handles the low-level filesystem interaction.

    Kind Regards,

    Paul Hernández