Showing posts from April, 2014
Good read on HDFS small file compaction: With that decided, we then looked for options to aggregate and compact small files on Hadoop, identifying three possible solutions: filecrush  - a highly configurable tool by  Edward Capriolo  to “crush” small files on HDFS. It supports a rich set of configuration arguments and is available as a jarfile ( download it here ) ready to run on your cluster. It’s a sophisticated tool - for example, by default it won’t bother crushing a file which is within 75% of the HDFS block size already. Unfortunately, it does not work yet with Amazon’s s3:// paths, only hdfs:// paths - and our  pull request  to add this functionality is incomplete Consolidator  - a Hadoop file consolidation tool from the  dfs-datastores  library, written by  Nathan Marz . There is scant documentation for this - we could only find one paragraph,  in this email thread . It has fewer capabilities than filecrush, and could do with a CLI-like wrapper to invoke it (we started w