Good read on HDFS small file compaction: With that decided, we then looked for options to aggregate and compact small files on Hadoop, identifying three possible solutions: filecrush - a highly configurable tool by Edward Capriolo to “crush” small files on HDFS. It supports a rich set of configuration arguments and is available as a jarfile ( download it here ) ready to run on your cluster. It’s a sophisticated tool - for example, by default it won’t bother crushing a file which is within 75% of the HDFS block size already. Unfortunately, it does not work yet with Amazon’s s3:// paths, only hdfs:// paths - and our pull request to add this functionality is incomplete Consolidator - a Hadoop file consolidation tool from the dfs-datastores library, written by Nathan Marz . There is scant documentation for this - we could only find one paragraph, in this email thread . It has fewer capabilities than filecrush, and could do with a CLI-like wrapper to invoke it (we started w
Posts
Showing posts from April, 2014