Showing posts from April, 2014
Good read on HDFS small file compaction:

With that decided, we then looked for options to aggregate and compact small files on Hadoop, identifying three possible solutions: filecrush - a highly configurable tool by Edward Capriolo to “crush” small files on HDFS. It supports a rich set of configuration arguments and is available as a jarfile (download it here) ready to run on your cluster. It’s a sophisticated tool - for example, by default it won’t bother crushing a file which is within 75% of the HDFS block size already. Unfortunately, it does not work yet with Amazon’s s3:// paths, only hdfs:// paths - and our pull request to add this functionality is incompleteConsolidator - a Hadoop file consolidation tool from the dfs-datastores library, written by Nathan Marz. There is scant documentation for this - we could only find one paragraph, in this email thread. It has fewer capabilities than filecrush, and could do with a CLI-like wrapper to invoke it (we started writing just such a wra…