我已经解决了添加
--packages org.apache.hadoop:hadoop-aws:2.7.1到spark-submit命令中的问题。
它将下载所有缺少的hadoop软件包,这些软件包将使您能够使用S3执行spark作业。
然后,在您的工作中,您需要像这样设置AWS凭证:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", aws_id)sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", aws_key)关于设置凭据的其他选项是将它们定义为spark / conf / spark-env:
#!/usr/bin/env bashAWS_ACCESS_KEY_ID='xxxx'AWS_SECRET_ACCESS_KEY='xxxx'SPARK_WORKER_CORES=1 # to set the number of cores to use on this machineSPARK_WORKER_MEMORY=1g # to set how much total memory workers have to give executors (e.g. 1000m, 2g)SPARK_EXECUTOR_INSTANCES=10 #, to set the number of worker processes per node
更多信息:
- 如何在AWS上运行PySpark
- AWS凭证



