1.概述2.版本信息及环境3.基础环境准备4.源码准备5.IDEA设置
5.1 IDEA内maven插件设置与更新5.2 IDEA内导入spark各个模块 6. 编译spark 与执行JavaWordCount 案例
6.1 指定版本编译spark6.2 spark-version-info.properties 文件处理6.3 添加jar包添加至需要的classpath6.4 JavaWordCount 执行环境设置6.5 设置计算data原始文件 附录
1). maven settings.xml国内源配置2). Could not find spark-version-info.properties 报错详情3). spark-version-info.properties 文件内容4). data原始文件 cnt.txt 内容 参考
1.概述 2.版本信息及环境| 项目 | 版本 | 备注 |
|---|---|---|
| os | win10 | |
| jdk | 1.8 | |
| scala | 2.11.12 | |
| spark | 2.4.8 | |
| maven | 5.8.1 | 可使用与源码一致版本 |
| sbt | 1.4 | 未发现作用 |
| idea | 2020.03 |
请自行查阅资料安装如下组件:
- win10本地安装jdk .win10本地安装scala .win10本地安装maven .
fork Spark源码1至个人GIT仓库,idea配置下github拉下来。太简单了就不写了。
5.IDEA设置 5.1 IDEA内maven插件设置与更新
国内源配置文件 settings.xml 请参见附录1.
在idea terminal 下进入spark源码根目录,指定Hadoop和yarn的版本,编译:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package ps:hadoop2.7较为稳定,但我本地是hadoop2.6的环境,所以保持现状.
若成功则显示如下图2
3用git bash以管理员身份运行build/spark-build-info 用以生成spark-version-info.properties文件
build/spark-build-info D:目录路径
将生成的spark-version-info.properties文件复制到spark-core_2.11-2.4.0-SNAPSHOT.jar
这个jar包的根目录中。(复制之前先检查根目录下是否存在spark-version-info.properties,不存在再复制)
ps: 不添加会报 Could not find spark-version-info.properties 错4 ,报错详见附录2.
但我在尝试执行上述 build/spark-build-info 命令时候报错了,于是我打开了这个shell手动生成了spark-version-info.properties文件
关键shell语句如下:
echo_build_properties() {
echo version=$1 --版本号
echo user=$USER --用户名
echo revision=$(git rev-parse HEAD) --很长的版本号
echo branch=$(git rev-parse --abbrev-ref HEAD) --分支
echo date=$(date -u +%Y-%m-%dT%H:%M:%SZ) --日期
echo url=$(git config --get remote.origin.url | sed 's|https://(.*)@(.*)|https://2|')
}
在git bash 执行生成结果如下图
最后生成的 spark-version-info.properties 文件内容参见附录3.
本文是以测试 JavaWordCount 程序作为源码环境成功观测点,所以是将相关jar包添加至 examples 模块。5
ps:data原始文件 cnt.txt 内容参见附录4.
至此,spark源码调试阅读环境搭建好了!!
2). Could not find spark-version-info.properties 报错详情D:mvn_res AUTOHOME admin admin123 alimaven central aliyun maven http://maven.aliyun.com/nexus/content/repositories/central/ alimaven aliyun maven http://maven.aliyun.com/nexus/content/groups/public/ central central Maven Repository Switchboard http://repo1.maven.org/maven2/ central repo2 central Human Readable Name for this Mirror. http://repo2.maven.org/maven2/ ibiblio central Human Readable Name for this Mirror. http://mirrors.ibiblio.org/pub/mirrors/maven2/ jboss-public-repository-group central JBoss Public Repository Group http://repository.jboss.org/nexus/content/groups/public google-maven-central Google Maven Central https://maven-central.storage.googleapis.com central maven.net.cn oneof the central mirrors in china http://maven.net.cn/content/groups/public/ central repo2 central http://repo2.maven.org/maven2/ rep cloudera https://repository.cloudera.com/artifactory/cloudera-repos alimaven Maven Aliyun Mirror http://maven.aliyun.com/nexus/content/repositories/central/ true false huaweicloudsdk https://repo.huaweicloud.com/repository/maven/huaweicloudsdk/ true true alimavenspark
Connected to the target VM, address: '127.0.0.1:60929', transport: 'socket' Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Exception in thread "main" java.lang.ExceptionInInitializerError at org.apache.spark.package$.3). spark-version-info.properties 文件内容(package.scala:93) at org.apache.spark.package$. (package.scala) at org.apache.spark.SparkContext$$anonfun$3.apply(SparkContext.scala:183) at org.apache.spark.SparkContext$$anonfun$3.apply(SparkContext.scala:183) at org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54) at org.apache.spark.SparkContext.logInfo(SparkContext.scala:73) at org.apache.spark.SparkContext. (SparkContext.scala:183) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2526) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:930) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:921) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:921) at org.apache.spark.examples.JavaWordCount.main(JavaWordCount.java:44) Caused by: org.apache.spark.SparkException: Could not find spark-version-info.properties at org.apache.spark.package$SparkBuildInfo$. (package.scala:62) at org.apache.spark.package$SparkBuildInfo$. (package.scala) ... 13 more Disconnected from the target VM, address: '127.0.0.1:60929', transport: 'socket' Process finished with exit code 1
4). data原始文件 cnt.txt 内容version=2.4.8
user=root
revision=4be566062defa249435c4d72xxxxxxxxxxxxxx
branch=branch-2.4
date=2022-02-16T09:58:54Z
url=https://github.com/你的github账号/spark.git
参考a
b
c
a
c
b
a
https://github.com/apache/spark spark源码官方地址 ↩︎
https://blog.csdn.net/qq_27667379/article/details/80251068 析Spark源码第一步——搭建源码阅读环境 分 ↩︎
https://blog.csdn.net/u011055139/article/details/81611814 windows10环境下搭建spark2.4.0源码阅读环境 ↩︎
https://blog.csdn.net/ggz631047367/article/details/53811213 spark2.1源码分析1:Win10下IDEA源码阅读环境的搭建 ↩︎
https://www.cnblogs.com/mracale/p/10493823.html Intellij IDEA 添加jar包的三种方式 ↩︎



