Posts

Showing posts from November, 2013

Hadoop1 and Hadoop2 cleanup files using s3 storage

Files in a HDFS file system can be configured to save files to a trash dir.  If your just using it as a file store you need to manually clean up all the files.  You can do this with the following HDFS command:

hadoop fs -Dfs.defaultFS=s3://myS3bucket -Dfs.trash.interval=0 -expunge

Hadoop2 Commands

In hadoop 1.X
HADOOP_HOME=/usr/lib/hadoop
  Basically you'll use $HADOOP_HOME/bin/hadoop for your commands


In hadoop 2.X
  Basically you'll use $HADOOP_HOME/bin/hadoop for your commands plus
/usr/lib/hadoop-hdfs/bin/hdfs
/usr/lib/hadoop-mapreduce/bin/mapred /usr/lib/hadoop-yarn/bin/yarn
Typical commands you'll want to use are:
hdfs dfs -ls hdfs balancer
mapred job -list
yarn jar

Install nginx in CentOS 64 using yum

###############
# Install nginx
###############

# add the repo to Nginx
export NGINX_REPO_FILE=/etc/yum.repos.d/nginx.repo
touch $NGINX_REPO_FILE
chmod 644 $NGINX_REPO_FILE
chown root:root $NGINX_REPO_FILE


echo "[nginx]" > $NGINX_REPO_FILE
echo "name=nginx repo" >> $NGINX_REPO_FILE
echo 'baseurl=http://nginx.org/packages/centos/$releasever/$basearch/' >> $NGINX_REPO_FILE
echo "gpgcheck=0" >> $NGINX_REPO_FILE
echo "enabled=1" >> $NGINX_REPO_FILE

# Verify it worked
cat $NGINX_REPO_FILE
yum repolist

#install nginx
yum -y install nginx.x86_64

# config: /etc/nginx/nginx.conf
# config: /etc/sysconfig/nginx
# pidfile: /var/run/nginx.pid
# User configs
# /etc/nginx/conf.d/*.conf;
# Log location
# /var/log/nginx/access.log

Hadoop distcp s3 vs s3n use on cmdline and limits

Like most of my posts this is short
To use distcp s3://key:secret@bucket/ you must have it setup as a file system configured with the NameNode.  So you'll basically always use it like s3://bucket-name/

s3 implementation here save blocks of files on hadoop and scrambles the names.  It can't be used standalone.

If you want to use s3 standalone use s3n.

You can test with hadoop fs -ls s3://bucket-name/ if you can access it great, it works.

s3n - which stands for the s3 native protocol has a 5 Gig file size limitation of amazon.

That's the short of it.
-Steve

S3 and S3N Config in Hadoop2 where to put awsAccessKeyId and awsSecretAcceesKey

Short answer is that in Hadoop2 but S3 and S3N setup both in:

"core-site.xml"

# To Setup S3 Block Filesystem


fs.default.name
s3://BUCKET



fs.s3.awsAccessKeyId
ID



fs.s3.awsSecretAccessKey
SECRET


# To Setup S3N Native Filesystem
fs.default.name s3n://BUCKET
fs.s3n.awsAccessKeyId ID
fs.s3n.awsSecretAccessKey SECRET