Meet Hadoop

Apache Hadoop 2.x 单机/伪分布式模式

Hadoop 单机伪分布式(Ubuntu)

配置本机 ssh 互信

1
2
3
4
5
6
7
8
9
10
11
$ sudo apt-get install ssh rsync openssh-server
$ ssh-keygen -t rsa -P ""
$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
$ ssh localhost 第一次输入 yes, 后面就不用了

hadoop@hadoop:~$ ssh localhost
Welcome to Ubuntu 14.10 (GNU/Linux 3.16.0-24-generic x86_64)
* Documentation: https://help.ubuntu.com/
67 packages can be updated.
31 updates are security updates.
Last login: Mon Nov 10 09:27:25 2014 from localhost hadoop@hadoop:~$

设置环境变量: /etc/profile 或者.bashrc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
export LD_LIBRARY_PATH=/usr/local/lib/
unset GNOME_DESKTOP_SESSION_ID
# Oracle
export ORACLE_HOME=/usr/lib/oracle/11.2/client64
export LD_LIBRARY_PATH=$ORACLE_HOME/lib:$LD_LIBRARY_PATH export TNS_ADMIN=$ORACLE_HOME/network/admin
export PATH=$PATH:$ORACLE_HOME/bin
# Maven ANT IVY
export M2_HOME=/home/hadoop/soft/apache-maven-3.0.4 export MAVEN_OPTS="-Xms256m -Xmx512m"
export ANT_HOME=/home/hadoop/soft/apache-ant-1.8.4 export IVY_HOME=/home/hadoop/soft/apache-ivy-2.3.0 export ANT_LIB=$ANT_HOME/lib
# JAVA SCALA ERLANG GO
export JAVA_HOME=/home/hadoop/soft/jdk1.7.0_67
export SCALA_HOME=/home/hadoop/soft/scala-2.10.4
export GRADLE_HOME=/home/hadoop/soft/gradle-1.12
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export ERL_HOME=/home/hadoop/soft/otp_src_R16B03-1
export GOROOT=/home/hadoop/soft/go
export GOPATH=/home/hadoop/data/go
# HADOOP SPARK STORM
#export HADOOP_HOME=/home/hadoop/soft/hadoop-0.20.2 #export HADOOP_HOME=/home/hadoop/soft/hadoop-1.0.4 #export HADOOP_HOME=/home/hadoop/soft/hadoop-2.2.0 #export HBASE_HOME=/home/hadoop/soft/hbase-0.96.0-hadoop2 #export PIG_HOME=/home/hadoop/soft/pig-0.12.0
#export HIVE_HOME=/home/hadoop/soft/hive-0.12.0 #export ZK_HOME=/home/hadoop/soft/zookeeper-3.4.5
#export HADOOP_HOME=/home/hadoop/soft/cdh4.6.0/hadoop-2.0.0-cdh4.6.0
#export HBASE_HOME=/home/hadoop/soft/cdh4.6.0/hbase-0.94.15-cdh4.6.0 #export HIVE_HOME=/home/hadoop/soft/cdh4.6.0/hive-0.10.0-cdh4.6.0 #export ZK_HOME=/home/hadoop/soft/cdh4.6.0/zookeeper-3.4.5-cdh4.6.0 #export HADOOP_HOME=/home/hadoop/soft/cdh5.1.2/hadoop-2.3.0-cdh5.1.2 #export HBASE_HOME=/home/hadoop/soft/cdh5.1.2/hbase-0.98.1-cdh5.1.2 #export HIVE_HOME=/home/hadoop/soft/cdh5.1.2/hive-0.12.0-cdh5.1.2 #export ZK_HOME=/home/hadoop/soft/cdh5.1.2/zookeeper-3.4.5-cdh5.1.2 #export SPARK_HOME=/home/hadoop/soft/spark-1.1.0-bin-hadoop2.3
export HADOOP_HOME=/home/hadoop/soft/cdh5.2.0/hadoop-2.5.0-cdh5.2.0
export HBASE_HOME=/home/hadoop/soft/cdh5.2.0/hbase-0.98.6-cdh5.2.0
export HIVE_HOME=/home/hadoop/soft/cdh5.2.0/hive-0.13.1-cdh5.2.0
export ZK_HOME=/home/hadoop/soft/cdh5.2.0/zookeeper-3.4.5-cdh5.2.0
export SPARK_HOME=/home/hadoop/soft/cdh5.2.0/spark-1.1.0-cdh5.2.0
export FLUME_HOME=/home/hadoop/soft/cdh5.2.0/apache-flume-1.5.0-cdh5.2.0-bin export CRUNCH_HOME=/home/hadoop/soft/cdh5.2.0/crunch-0.11.0-cdh5.2.0
export DATAFU_HOME=/home/hadoop/soft/cdh5.2.0/datafu-1.1.0
#export HBASE_INDEXER_HOME=/home/hadoop/soft/cdh5.2.0/hbase-solr-1.5-cdh5.2.0 export HUE_HOME=/home/hadoop/soft/cdh5.2.0/hue-3.6.0-cdh5.2.0
export KITE_HOME=/home/hadoop/soft/cdh5.2.0/kite-0.15.0-cdh5.2.0
export MAHOUT_HOME=/home/hadoop/soft/cdh5.2.0/mahout-0.9-cdh5.2.0
export OOZIE_HOME=/home/hadoop/soft/cdh5.2.0/oozie-4.0.0-cdh5.2.0
export PIG_HOME=/home/hadoop/soft/cdh5.2.0/pig-0.12.0-cdh5.2.0
export SOLR_HOME=/home/hadoop/soft/cdh5.2.0/solr-4.4.0-cdh5.2.0
export SQOOP2_HOME=/home/hadoop/soft/cdh5.2.0/sqoop2-1.99.3-cdh5.2.0
export HADOOP_MAPRED_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} export YARN_HOME=${HADOOP_HOME}
export HADOOP_YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HDFS_CONF_DIR=${HADOOP_HOME}/etc/hadoop export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export STORM_HOME=/home/hadoop/soft/apache-storm-0.9.1-incubating export JSTORM_HOME=/home/hadoop/soft/jstorm-0.9.2
export KAFKA_HOME=/home/hadoop/soft/kafka_2.10-0.8.1
export PATH=$JAVA_HOME/bin:$PATH:$M2_HOME/bin:$ANT_HOME/bin
export PATH=$PATH:$SCALA_HOME/bin:$GRADLE_HOME/bin
export PATH=$PATH:$ERL_HOME/bin:$GOROOT/bin:$GOPATH
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HBASE_HOME/bin:$HIVE_HOME/bin:$ZK_HOME/bin export PATH=$PATH:$STORM_HOME/bin:$JSTORM_HOME/bin:$KAFKA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin export PATH=$PATH:$FLUME_HOME/bin:$MAHOUT_HOME/bin:$OOZIE_HOME/bin:$PIG_HOME/bin:$SQOOP2_HOME/bin export HTTP_CLIENT="wget --no-check-certificate -O"

1: HADOOP_HOME/bin 下放的是 hadoop,hdfs 命令; HADOOP_HOME/sbin 下放的是启动集群的命令都要加入 PATH

配置文件

hadoop-0.20.x | hadoop-1.x

conf/core-site.xml

1
2
3
4
5
6
<configuration> 
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

conf/hadoop-env.sh

1
export JAVA_HOME=/home/hadoop/soft/jdk1.7.0_67

conf/hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration> 
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/data/hadoop1/nn</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/data/hadoop1/dn</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

conf/mapred-site.xml

1
2
3
4
5
6
<configuration> 
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

Hadoop-2.x | chd4.x | cdh5.x

1
$ cd $HADOOP_HOME/etc/hadoop

hadoop-env.sh

1
export JAVA_HOME=/usr/local/java/jdk1.7.0_67

core-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<configuration> 
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/data/cdh520/tmp</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>10080</value>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>10080</value>
</property>
</configuration>

注 1: 完全分布式模式下,要将 localhost 改成 namenode 的地址或主机名,下同.

hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<configuration> 
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/data/cdh520/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/data/cdh520/dn</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>localhost:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>localhost:50090</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>

mapred-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<configuration> 
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
</configuration>

yarn-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
<configuration> 
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>localhost:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>localhost:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>localhost:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>localhost:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>localhost:8088</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>
$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*,
$HADOOP_HDFS_HOME/share/hadoop/hdfs/*,$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $YARN_HOME/share/hadoop/yarn/*,
$YARN_HOME/share/hadoop/yarn/lib/*, $YARN_HOME/share/hadoop/mapreduce/*,$YARN_HOME/share/hadoop/mapreduce/lib/*
</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/hadoop/data/cdh520/yarn/local</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/hadoop/data/cdh520/yarn/logs</value>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/home/hadoop/data/cdh520/yarn/logs</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/home/hadoop/data</value>
</property>
</configuration>

注 2: apache 原始版本配置项要改成 mapreduce_shuffle

创建目录结构:

1
2
3
4
5
6
7
/home/hadoop/data/cdh520/
|---dn
|---nn
|---tmp
|---yarn
|---logs
|---local

格式化 NameNode 并启动 Hadoop

1
2
3
4
$ hadoop namenode -format
$ start-dfs.sh
$ start-yarn.sh
$ mr-jobhistory-daemon.sh start historyserver

注 3: 3)4)步骤可以在任何位置执行,因为配置了 HADOOP_HOME/bin 和 HADOOP_HOME/sbin 到 PATH 下,按 tab 键能自动补全
注4: 和 hadoop-1.x 不同,start-dfs.sh 在 sbin 下,所以也要把 sbin 加入 PATH
注5: 第一次使用时需要对NameNode格式化,后面就不需要了. 如果要重新格式化,要确保先将DataNode的目录清空.否则DN无法启动

1
2
3
4
zqhxuyuan@zqh:~$ jps 25191 DataNode
25518 SecondaryNameNode 26789 Jps
24924 NameNode
25992 NodeManager 25718 ResourceManager 360 JobHistoryServer

停止 Hadoop

1
2
$ stop-yarn.sh
$ stop-dfs.sh

浏览器查看

集群: http://127.0.0.1:8088/cluster
文件系统: http://localhost:50070/dfshealth.jsp

测试MapReduce

1
2
3
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep file:///Users/zhengqh/Soft/hadoop-2.7.3/input output 'dfs[a-z.]+'

Meet CDH-4.6.0

1. env

1
2
3
4
HADOOP_HOME ZK_HOME HBASE_HOME HIVE_HOME
scp .bashrc h2:~/
scp .bashrc h3:~/
source .bashrc

Hadoop

参考 02.Hadoop 集群分布式(CentOS)安装指南, 并启动 start-dfs.sh start-yarn.sh

ZooKeeper

1). zoo.cfg

1
2
3
4
5
6
7
8
9
10
11
12
[hadoop@h1 ~]$ tar zxf zookeeper-3.4.5-cdh4.6.0.tar.gz 
[hadoop@h1 ~]$ cd zookeeper-3.4.5/conf
[hadoop@h1 conf]$ vi zoo.cfg

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/hadoop/data/zookeeper
clientPort=2181
server.1=h1:2888:3888
server.2=h2:2888:3888
server.3=h3:2888:3888

2). dataDir & myid

1
2
$ mkdir /home/hadoop/data/zookeeper
$ echo “1” > /home/hadoop/data/zookeeper/myid

3). 同步 zk 到其他节点

1
2
scp -r zookeeper-3.4.5-cdh4.6.0 h2:
scp -r zookeeper-3.4.5-cdh4.6.0 h3:

4) 在 h2, h3 的/home/hadoop/data/zookeeper/myid 分别修改值为 2, 3

5) 在每个节点分别执行 zkServer.sh start, 通过 jps 查看进程, 通过 zkServer.sh status 查看 zk 状态(是 leader 还是 follower)

6). zookeeper 客户端测试

1
$ zkCli.sh

HBase

1). hbase-env.sh

1
2
export JAVA_HOME=/usr/local/java/jdk1.7.0_04 
export HBASE_MANAGES_ZK=false

2). hbase-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<configuration> 
<property>
<name>hbase.rootdir</name>
<value>hdfs://h1:9000/hbase</value>
</property>
<property>
<name>hbase.master</name>
<value>h1:60000</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>h1,h2,h3</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/data/zookeeper</value>
</property>
</configuration>

$ sudo vi regionserver

1
2
3
h1
h2
h3

3). 同步 hbase 到其他节点

1
2
scp -r hbase-0.94.15-cdh4.6.0 h2: 
scp -r hbase-0.94.15-cdh4.6.0 h2:

4). 在主节点 h1 上启动

1
hbase $ start-hbase.sh

5). hbase 客户端测试

1
2
3
4
5
6
$ hbase shell
> create 'table','cf'
> list
> put 'table','rowkey1','cf:column','value'
> scan 'table'
> get 'table','rowkey1'

5. Hive

1). 安装 MySQL

创建 hive 元数据数据库 hivedb, 当然先不创建也可以. 只不过这里主要是设置数据库的编码.

1
2
3
4
5
$ mysql -u root -p 密码输入 root
mysql> create database hive default character set latin1;
mysql> create user 'hive'@'localhost' identified by 'hive';
mysql> grant all on hive.* to 'hive'@'localhost';
mysql> flush privileges;

可以不用创建 hive 用户. 如果创建了 hive 用户, 可以使用 mysql -u hive -p 输入密码 hive 登陆查看是否有 hivedb 数据库 如果没有事先设置数据库的编码, 可以在后期手动设置

1
mysql> alter database hive character set latin1;

2). 在 h1 中配置 hive

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ cd $HIVE_HOME/conf 
$ vi hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.local</name>
<value>false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://h1:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://h1:9083</value>
</property>
</configuration>

3). 在$HIVE_HOME/bin 下启动呢 metastore 和 hiveserver

1
2
$ hive --service metastore &
$ hive --service hiveserver2 &

实际上因为$HIVE_HOME/bin 加入到 PATH 中. 而上面的两个脚本也是在$HIVE_HOME/bin 下. 因此可在任何路径启动上面两个服务

5). hive 客户端测试 $ hive

1
2
3
4
5
6
> show tables;
> create table records(year STRING, temp INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
> LOAD DATA LOCAL INPATH 'hadoop-book-2e/input/ncdc/micro-tab/sample.txt' overwrite into table records;
> select * from records;
> select year, max(temp) from records where temp != 9999
> and (quality =0 or quality=1 or quality=4 or quality=5 or quality=9) group by year;

注意上面加载本地数据, hadoop-book-2e 应该在执行 hive 客户端命令的同一目录, 如果不是可以使用绝对路径或相对路径
查看 h1:50070 上的 hdfs 是否有 hdfs://h1:9000/user/hive/warehouse/records/sample.txt

6). 验证 hive 元数据
查看 MySQL 是否生成 hivedb 数据库, 这里保存的是 hive 的元数据

7). hiveserver2 使用 beeline
启动 hive 服务要使用 hiveserver2:

1
hive --service hiveserver2

文章目录
  1. 1. Hadoop 单机伪分布式(Ubuntu)
    1. 1.1. 配置本机 ssh 互信
    2. 1.2. 设置环境变量: /etc/profile 或者.bashrc
    3. 1.3. 配置文件
      1. 1.3.1. hadoop-0.20.x | hadoop-1.x
      2. 1.3.2. Hadoop-2.x | chd4.x | cdh5.x
    4. 1.4. 创建目录结构:
    5. 1.5. 格式化 NameNode 并启动 Hadoop
    6. 1.6. 停止 Hadoop
    7. 1.7. 浏览器查看
    8. 1.8. 测试MapReduce
  2. 2. Meet CDH-4.6.0
    1. 2.1. 1. env
    2. 2.2. Hadoop
    3. 2.3. ZooKeeper
    4. 2.4. HBase
    5. 2.5. 5. Hive