本地运行:在IDEA中直接运行控制台输出结果即可
集群运行:在本地将程序打包为 jar,提交至集群运行其程序(将结果上传至hdfs)
文章目录
一、本地运行spark程序
二、集群运行spark程序
一、本地运行spark程序
1、pom依赖
注:依赖配置项及其版本一定要与集群环境相适配
4.0.0 cn.itcast SparkDemo1.0-SNAPSHOT aliyun http://maven.aliyun.com/nexus/content/groups/public/ apache https://repository.apache.org/content/repositories/snapshots/ cloudera https://repository.cloudera.com/artifactory/cloudera-repos/ UTF-8 1.8 1.8 2.12.11 3.0.1 2.7.5 org.scala-lang scala-library${scala.version} org.apache.spark spark-core_2.12${spark.version} org.apache.spark spark-streaming_2.12${spark.version} org.apache.spark spark-streaming-kafka-0-10_2.12${spark.version} org.apache.spark spark-sql_2.12${spark.version} org.apache.spark spark-hive_2.12${spark.version} org.apache.spark spark-hive-thriftserver_2.12${spark.version} org.apache.spark spark-sql-kafka-0-10_2.12${spark.version} org.apache.spark spark-mllib_2.12${spark.version} org.apache.hadoop hadoop-client2.7.5 com.hankcs hanlpportable-1.7.7 mysql mysql-connector-java8.0.23 redis.clients jedis2.9.0 com.alibaba fastjson1.2.47 org.projectlombok lombok1.18.2 provided src/main/scala org.apache.maven.plugins maven-compiler-plugin3.5.1 net.alchim31.maven scala-maven-plugin3.2.2 compile testCompile -dependencyfile ${project.build.directory}/.scala_dependencies org.apache.maven.plugins maven-surefire-plugin2.18.1 false true ***Suite.* org.apache.maven.plugins maven-shade-plugin2.3 package shade *:* meta-INF/*.SF meta-INF/*.DSA meta-INF/*.RSA
2、数据展示
3、代码编写
package org.example.spark
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object word {
def main(args: Array[String]): Unit = {
//准备环境
val conf = new SparkConf().setMaster("local[*]").setAppName("wordcount")
val sc = new SparkContext(conf)
//加载文件
val rdd1: RDD[String] = sc.textFile("data/input/words.txt")
// 处理数据
val rdd2: RDD[String] = rdd1.flatMap(lp => {
lp.split(" ")
})
val rdd3: RDD[(String, Int)] = rdd2.map(it => (it, 1))
val rdd4: RDD[(String, Int)] = rdd3.reduceByKey((curr, agg) => curr + agg)
val result: Array[(String, Int)] = rdd4.collect()
result.foreach(i => println(i))
}
}
4、本地运行
注:单词统计案例本地效果如图所示
二、集群运行spark程序
1、修改代码
val rdd1: RDD[String] = sc.textFile("hdfs:///input/wordcount.txt")
rdd4.saveAsTextFile("hdfs://192.168.231.247:8020/output/output1")
注:集群运行文件加载路径设置为hdfs,即每次集群运行从hdfs拿取数据,并将实时数据上传至hdfs
2、打包jar
注:双击maven中的package,maven会自动进行清除缓存,测试并打包为jar
3、找到项目路径中的jar包
注:jar包大小最小的为不是带全部依赖的jar包,在集群运行不需要全部的依赖,即上传最小依赖的jar包即可
4、上传至linux
注:此处使用xftp进行传输 jar包
5、启动 hadoop 以及 spark 集群
6、进入spark安装目录下执行
bin/spark-submit --class org.example.spark.word --master spark://master:8020 /input/original-SparkDemo-1.0-SNAPSHOT.jar
注:单词统计集群运行如图所示
7、进入hdfs web端目录进行查看



