Data.Analysis.with.Python.and.PySpark:准备

The book focuses on Spark version 3.2

How PySpark works

Under the hood, it looks more like what’s on the right: you have some workbenches that some workers are assigned to. The workbenches are like the computers in our Spark cluster: there is a fixed amount of them.

The workers are called executors in Spark’s literature: they perform the actual work on the machines/nodes.

One of the little workers looks spiffier than the other. That top hat definitely
makes him stand out from the crowd. In our data factory, he’s the manager of the
work floor. In Spark terms, we call this the master.4 The master here sits on one of the
workbenches/machines, but it can also sit on a distinct machine (or even your computer!) depending on the cluster manager and deployment mode. The role of the
master is crucial to the efficient execution of your program, so section 1.2.2 is dedicated to this.

Spark提供了自己的集群管理器，称为Standalone，但在与Hadoop或其他大数据平台合作时，也可以与其他集群管理器配合使用。

If you read about YARN, Mesos, or Kubernetes in the wild, know that they are used (as far as Spark is concerned) as cluster managers.

转换几乎是其他一切。以下是一些转换示例：
 向表中添加列
 根据特定密钥执行聚合
 计算摘要统计数据
 训练机器学习模型

一旦接收到任务及其操作，driver程序就会开始将数据分配给Spark所称的执行者。executor执行器是为应用程序运行计算和存储数据的进程。这些执行器位于所谓的工作节点上，即实际的计算机。在我们的工厂类比中，执行者是执行工作的员工，而worker节点是许多员工/执行者可以工作的工作台

在本地计算机上安装PySpark（Appendix B—Installing PySpark）

macOS操作系统下：

brew install apache-spark

If Homebrew did not set $SPARK_HOME when installing Spark on your machine (test
by restarting your terminal and typing echo $SPARK_HOME), you will need to add the
following to your ~/.zshrc:

export SPARK_HOME="/usr/local/Cellar/apache-spark/X.Y.Z/libexec"
#Make sure you are inputting the right version number in lieu of X.Y.Z.

将Spark配置为与Python无缝协作

spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"
spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true"

Install Anaconda/Python

conda create -n pyspark python=3.8 pandas ipython pyspark=3.2.0

If it’s your first time using Anaconda, follow the instructions to register your shell.

WARNING Python 3.8+ is supported only using Spark 3.0. If you use Spark
2.4.X or before, be sure to specify Python 3.7 in your environment creation.

然后，要选择新创建的环境，只需在终端中输入conda activate pyspark。

启动IPython REPL并启动PySpark

Homebrew应该有SPARK_HOME和PATH环境变量，这样您的Python shell（也称为REPL，或read eval print loop）就可以访问PySpark的本地实例。您只需键入以下内容：

conda activate pyspark
ipython

然后，在REPL中，您可以导入PySpark并开始运行：

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

（可选）安装并运行Jupyter以使用Jupyter notebook，在Anaconda PowerShell窗口中，使用以下命令安装Jupyter：

conda install -c conda-forge notebook

现在可以使用以下命令运行Jupyter笔记本服务器。在执行此操作之前，请使用cd移动到源代码所在的目录：

cd [WORKING DIRECTORY]
jupyter notebook

设置pyspark 启动时所使用的python版本

export PYSPARK_PYTHON=/usr/local/bin/python3

指定pyspark启动时使用的python版本

export PYSPARK_DRIVER_PYTHON=ipython3

指定启动pyspark后的交互式界面，使用ipython3作为默认启动交互界面

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port 8889 --ip=172.16.80.142"

unset XDG_RUNTIME_DIR

指定启动pyspark后的交互式界面，使用jupyter作为默认启动的交互界面，指定ip 和port 可以让同网段的设备访问jupyter 进行pyspark 操作

Data.Analysis.with.Python.and.PySpark:准备

大数据系统相关栏目本月热门文章