Spark on YARN

Apache Spark On Yarn模式

参考文档:

spark on yarn有两种模式:

  • yarn client:Driver独立于YARN集群外
  • yarn cluster:Driver运行在YARN集群中

yarn-cluster

注意:其实在安装并启动好Hadoop后,Spark on YARN不需要任何配置。

1.上传spark-assembly.jar包到hdfs的某个路径下,比如这里是/lib

1
2
[qihuang.zheng@dp0652 lib]$ /usr/install/hadoop/bin/hadoop fs -mkdir /lib
[qihuang.zheng@dp0652 lib]$ /usr/install/hadoop/bin/hadoop fs -put spark-assembly-1.4.0-hadoop2.6.0.jar /lib

2.提交作业时,spark-submit的时候--master发生了变化

1
2
3
4
5
6
7
8
9
SPARK_JAR=hdfs://tdhdfs/lib/spark-assembly-1.4.0-hadoop2.6.0.jar \
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-cluster \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
lib/spark-examples*.jar \
10

注意: 因为现在是Spark on YARN了,所以spark-submit的Job不会显示在Spark的UI上,而是HadoopYARN的UI上
而且, 因为是yarn-cluster模式, 终端并不会输出Pi的值! 相反,下面yarn-client模式会在终端输出Pi的值.

yarn-client

和yarn-cluster一样,只不过--masteryarn-cluster改成了yarn-client

1
2
3
4
5
6
7
8
9
SPARK_JAR=hdfs://tdhdfs/lib/spark-assembly-1.4.0-hadoop2.6.0.jar \
bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
lib/spark-examples-1.4.0-hadoop2.6.0.jar \
10

使用yarn-client提交应用程序,终端的日志输出:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
15/07/07 15:34:07 INFO SparkUI: Started SparkUI at http://192.168.6.52:4040
15/07/07 15:34:07 INFO SparkContext: Added JAR file:/usr/install/spark-1.4.0-bin-hadoop2.6/lib/spark-examples-1.4.0-hadoop2.6.0.jar at http://192.168.6.52:42366/jars/spark-examples-1.4.0-hadoop2.6.0.jar with timestamp 1436254447750
15/07/07 15:34:08 INFO ConfiguredRMFailoverProxyProvider: Failing over to rm2
15/07/07 15:34:08 INFO Client: Requesting a new application from cluster with 2 NodeManagers
15/07/07 15:34:08 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/07/07 15:34:08 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/07/07 15:34:08 INFO Client: Setting up container launch context for our AM
15/07/07 15:34:08 INFO Client: Preparing resources for our AM container
15/07/07 15:34:08 WARN Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
15/07/07 15:34:08 INFO Client: Source and destination file systems are the same. Not copying hdfs://tdhdfs/lib/spark-assembly-1.4.0-hadoop2.6.0.jar
15/07/07 15:34:09 INFO Client: Uploading resource file:/tmp/spark-c564fff3-dd94-4b52-81c9-6e6f7e4d4ceb/__hadoop_conf__2813971044505707826.zip -> hdfs://tdhdfs/user/qihuang.zheng/.sparkStaging/application_1436175086022_0003/__hadoop_conf__2813971044505707826.zip
15/07/07 15:34:09 INFO Client: Setting up the launch environment for our AM container
15/07/07 15:34:09 WARN Client: SPARK_JAR detected in the system environment. This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
15/07/07 15:34:09 INFO Client: Submitting application 3 to ResourceManager
15/07/07 15:34:09 INFO YarnClientImpl: Submitted application application_1436175086022_0003
15/07/07 15:34:10 INFO Client: Application report for application_1436175086022_0003 (state: ACCEPTED)
15/07/07 15:34:11 INFO Client: Application report for application_1436175086022_0003 (state: ACCEPTED)
...
...
15/07/07 15:34:38 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
15/07/07 15:34:38 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/07/07 15:34:38 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31)
15/07/07 15:34:38 INFO YarnScheduler: Adding task set 0.0 with 10 tasks
15/07/07 15:34:38 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, dp0653, PROCESS_LOCAL, 1446 bytes)
15/07/07 15:34:38 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, dp0652, PROCESS_LOCAL, 1446 bytes)
...
15/07/07 15:34:48 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on dp0653:43216 (size: 1202.0 B, free: 1060.3 MB)
15/07/07 15:34:49 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 10414 ms on dp0653 (10/10)
15/07/07 15:34:49 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:35) finished in 10.420 s
15/07/07 15:34:49 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/07/07 15:34:49 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 10.776165 s
Pi is roughly 3.144008

第一行日志:Started SparkUI at http://192.168.6.52:4040,所以client模式也可以在spark-ui上查看作业.

注意:SparkUI指的是应用程序的UI,每个应用程序都有UI,并且端口都不同。不过Standalone模式,还有一个MasterUI(8082)。

下图是YARN的应用程序, 第一个是yarn-client(7:38), 第二个是yarn-cluster(7:36)

问题

在YARN上部署Spark应用程序的时候,不需要象Standalone、Mesos一样提供URL作为master参数的值,
因为Spark应用程序可以在hadoop的配置文件里面获取相关的信息,
所以只需要简单以yarn-cluster或yarn-client指定给master就可以了

YARN Client: Spark driver在客户机上运行,然后向YARN申请运行exeutor以运行Task
YARN Cluster: Spark driver将作为一个ApplicationMaster在YARN集群中先启动,
然后再由ApplicationMaster向RM申请资源启动executor以运行Task

采用yarn-client方式,因为driver在客户端,所以程序的运行结果可以在客户端显示
采用yarn-cluster方式,因为driver在YARN中运行,所以程序的运行结果不能在客户端显示,
所以最好将结果保存在hdfs上,客户端的终端显示的是作为YARN的job的运行情况

新版本

在新版本(2.0+)中,提交应用程序的命令为:

1
2
bin/spark-shell --master yarn --deploy-mode cluster
bin/spark-shell --master yarn --deploy-mode client

另外,新版本指定spark-assembly通过spark.yarn.jars配置项指定。
spark-1.6的hdfs路径只需要一个spark-assembly.jar即可,
而spark-2.x需要把jars下的所有jar包都上次到hdfs。

单机模式Spark on Yarn

UpdateDate: 20170720

软件:

  • hadoop-2.7.2
  • spark-2.1.1-bin-hadoop2.7

看官方的文档,十分钟入门,单机模式: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html

HDFS UI:http://localhost:50070/
YARN UI:http://localhost:8088/cluster

查看HDFS和YARN的进程:

1
2
3
4
5
6
➜  hadoop-2.7.2 jps -lm
34976 org.apache.hadoop.yarn.server.nodemanager.NodeManager
34628 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
34443 org.apache.hadoop.hdfs.server.namenode.NameNode
34523 org.apache.hadoop.hdfs.server.datanode.DataNode
34891 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager

环境变量中添加下面两个配置,Spark的官网中说只要HADOOP_CONF_DIR或者YARN_CONF_DIR一个指向Hadoop即可,不需要export $HADOOP_HOME/bin之类

1
2
3
4
export HADOOP_HOME=/Users/zhengqh/Downloads/hadoop-2.7.2
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

source ~/.zshrc

确保机器有足够的内存

1
2
3
bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn --deploy-mode client \
examples/jars/spark-examples_2.11-2.1.1.jar 10

文章目录
  1. 1. yarn-cluster
  2. 2. yarn-client
    1. 2.1. 问题
    2. 2.2. 新版本
  3. 3. 单机模式Spark on Yarn