架設Hive On Spark

將Hive 分工引擎從Map Reduce 換成 Spark

配置Hive

複製必要的Spark Jar到Hive Lib

$ cp ~/spark/jars/scala-library-2.11.8.jar ~/hive/lib/
$ cp ~/spark/jars/spark-network-common_2.11-2.3.1.jar ~/hive/lib/
$ cp ~/spark/jars/spark-core_2.11-2.3.1.jar ~/hive/lib/

配置hive-site.xml

$ vi ~/hive/conf/hive-site.xml 

配置內容

<configuration>
  <!--jdbc-->
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.cj.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>shark</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>shark</value>
  </property>
  <property>
    <name>datanucleus.schema.autoCreateAll</name>
    <value>true</value>
  </property>
  <!--spark engine -->
  <property>
    <name>hive.execution.engine</name>
    <value>spark</value>
  </property>
  <property>
    <name>hive.enable.spark.execution.engine</name>
    <value>true</value>
  </property>
  <!--sparkcontext -->
  <property>
    <name>spark.master</name>
    <!--
    <value>yarn-cluster</value>
    -->
    <value>spark://hadoop1:7077</value>
  </property>
  <property>
    <name>spark.serializer</name>
    <value>org.apache.spark.serializer.KryoSerializer</value>
  </property>
</configuration>

在配置spark.master時,測試過用yarn-cluster有時候明明Spark工作執行結束,但是Yarn卻一直在Padding,所以乾脆直接指定Spark,不過這樣作要把hive資料夾從hadoop1複製到Spark Worker所在的主機,因為Spark Worker要參照Hive的Lib

$scp -r ~/hive hadoop3:/home/hadoop
$scp -r ~/hive hadoop4:/home/hadoop

測試Hive on Spark

啟動Hive

$ ~/hive/bin/hive

測試

hive> use demo;
hive> select count(*) from phone;
Query ID = hadoop_20180827103004_0fbfe6c7-f3c7-42ab-9161-9c2a06da2102
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Running with YARN Application = application_1534922925601_0006
Kill Command = /home/hadoop/hadoop/bin/yarn application -kill application_1534922925601_0006
Hive on Spark Session Web UI URL: http://hadoop6:34641

Query Hive on Spark job[0] stages: [0, 1]
Spark job[0] status = RUNNING
--------------------------------------------------------------------------------------
          STAGES   ATTEMPT        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  
--------------------------------------------------------------------------------------
Stage-0 ........         0      FINISHED      1          1        0        0       0  
Stage-1 ........         0      FINISHED      1          1        0        0       0  
--------------------------------------------------------------------------------------
STAGES: 02/02    [==========================>>] 100%  ELAPSED TIME: 10.19 s    
--------------------------------------------------------------------------------------
Spark job[0] finished successfully in 10.19 second(s)
OK
7
Time taken: 51.209 seconds, Fetched: 1 row(s)

Last updated