大数据开发环境就是这样,你刚填完一个坑,就掉入了另外一个坑,运行一个spark远程调用示例
SparkConf sparkConf = new SparkConf()
.setMaster("spark://ss3:7077")
.setAppName("JavaSparkPi");
SparkSession spark = SparkSession
.builder()
.config(sparkConf)
.getOrCreate();
结果在创建SparkConf对象的时候,报如下错误
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FSDataInputStream at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:75) at org.apache.spark.SparkConf.(SparkConf.scala:70) at org.apache.spark.SparkConf. (SparkConf.scala:59) at JavaWordCount.main(JavaWordCount.java:54) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 4 more
查找资料后了解到,从spark1.4以后,所有spark的编译都是没有将hadoop的classpath编译进去的,所以必须在spark-env.sh中指定hadoop中的所有jar包。
既然原因找到了,那我们就去spark修改spark-env.sh, 加上以下配置,前提是hadoop命令已经可以使用
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
同时还需要检查pom配置,下面贴出正确的pom依赖配置
org.apache.hadoop hadoop-common${hadoop.version} org.apache.hadoop hadoop-mapreduce-client-core${hadoop.version} org.apache.spark spark-core_2.12${spark.version} org.apache.spark spark-hive_2.12${spark.version} org.apache.spark spark-streaming_2.12${spark.version} org.scala-lang scala-library${scala.version} com.google.code.gson gsonorg.jboss.netty nettyio.netty nettyio.netty netty-allorg.apache.spark spark-sql_2.12${spark.version} org.apache.hive hive-jdbc${hive.version} org.apache.commons commons-lang33.8.1



