Good read on HDFS small file compaction:

With that decided, we then looked for options to aggregate and compact small files on Hadoop, identifying three possible solutions:
  1. filecrush - a highly configurable tool by Edward Capriolo to “crush” small files on HDFS. It supports a rich set of configuration arguments and is available as a jarfile (download it here) ready to run on your cluster. It’s a sophisticated tool - for example, by default it won’t bother crushing a file which is within 75% of the HDFS block size already. Unfortunately, it does not work yet with Amazon’s s3:// paths, only hdfs:// paths - and our pull request to add this functionality is incomplete
  2. Consolidator - a Hadoop file consolidation tool from the dfs-datastores library, written by Nathan Marz. There is scant documentation for this - we could only find one paragraph, in this email thread. It has fewer capabilities than filecrush, and could do with a CLI-like wrapper to invoke it (we started writing just such a wrapper but then we found filecrush)
  3. S3DistCp - created by Amazon as an S3-friendly adaptation of Hadoop’s DistCputility for HDFS. Don’t be fooled by the name - if you are running on Elastic MapReduce, then this can deal with your small files problem using its--groupBy option for aggregating files (which the original DistCp seems to lack)
Source: dealing with small files in hadoop

Comments

Popular posts from this blog

Vim vi how to reload a file your editing