1.熟悉 Spark RDD的基本操作。
2.熟悉使用RDD编程解决具体问题的方法。
编程实现输出前3个学生的信息、文件中前3个学生的平均分、文件中前3个学生的最高分、文件中前3个学生的平均分、总分数最高的前三名、Scala成绩最高的前3名、Python成绩最高的前3名、Java成绩最高的前3名。
三、 实验过程:未使用Scala或pyshark,仅通过python实现然后在spark中运行py文件
(1) 安装Spark,验证Spark是否安装成功
启动Spark
(2) 输出前3个学生信息:
from pyspark import SparkContext
with open('student.txt', "r", encoding='utf-8') as file:
ls = list(file)
del ls[0]
for i in range(3):
x = ls[i].strip('n')
print(x)
(3) 文件中前3个学生的平均分
from pyspark import SparkContext
with open('student.txt', "r", encoding='utf-8') as file:
ls = list(file)
del ls[0]
scala_sum, python_sum, java_sum = 0, 0, 0
scala_avg, python_avg, java_avg = 0, 0, 0 # 前三名各科平均分
scala, python, java = [], [], []
for i in range(3):
x = ls[i].strip('n')
scala.append(int(x.split('t')[2].strip()))
python.append(int(x.split('t')[3].strip()))
java.append(int(x.split('t')[4].strip()))
for i in range(3):
scala_sum = scala_sum + scala[i]
scala_avg = scala_sum / 3
for i in range(3):
python_sum = python_sum + python[i]
python_avg = python_sum / 3
for i in range(3):
java_sum = java_sum + java[i]
java_avg = java_sum / 3
print('Scala平均分:'+str(scala_avg))
print('Python平均分:'+str(python_avg))
print('Java平均分:'+str(java_avg))
(4) 文件中前3个学生的最高分
from pyspark import SparkContext
with open('student.txt', "r", encoding='utf-8') as file:
ls = list(file)
del ls[0]
scala_sum, python_sum, java_sum = 0, 0, 0
scala_avg, python_avg, java_avg = 0, 0, 0 # 前三名各科平均分
scala, python, java = [], [], []
for i in range(3):
x = ls[i].strip('n')
scala.append(int(x.split('t')[2].strip()))
python.append(int(x.split('t')[3].strip()))
java.append(int(x.split('t')[4].strip()))
print('Scala最高分:'+str(max(scala)))
print('Python最高分:'+str(max(python)))
print('Java最高分:'+str(max(java)))
(5) 总分数最高的前三名
from pyspark import SparkContext
with open('student.txt', "r", encoding='utf-8') as file:
ls = list(file)
del ls[0]
rank = {} # 名词字典
for i in range(len(ls)):
x = ls[i].strip('n')
score = [int(x.split('t')[2].strip()), int(x.split('t')[3].strip()), int(x.split('t')[4].strip())] # 成绩总和
score = sum(score)
rank[x.split('t')[1].strip()] = score
rank_list = sorted(rank.items(), key=lambda x: x[1], reverse=True)
print(rank_list[:3])
(6) Scala成绩最高的前3名
from pyspark import SparkContext
with open('student.txt', "r", encoding='utf-8') as file:
ls = list(file)
del ls[0]
rank = {} # 名词字典
for i in range(len(ls)):
x = ls[i].strip('n')
score = [int(x.split('t')[2].strip())]
score = sum(score)
rank[x.split('t')[1].strip()] = score
rank_list = sorted(rank.items(), key=lambda x: x[1], reverse=True)
print(rank_list[:3])
(7) Python成绩最高的前3名
from pyspark import SparkContext
with open('student.txt', "r", encoding='utf-8') as file:
ls = list(file)
del ls[0]
rank = {} # 名词字典
for i in range(len(ls)):
x = ls[i].strip('n')
score = [int(x.split('t')[3].strip())]
score = sum(score)
rank[x.split('t')[1].strip()] = score
rank_list = sorted(rank.items(), key=lambda x: x[1], reverse=True)
print(rank_list[:3])
(8) Java成绩最高的前3名
from pyspark import SparkContext
with open('student.txt', "r", encoding='utf-8') as file:
ls = list(file)
del ls[0]
rank = {} # 名词字典
for i in range(len(ls)):
x = ls[i].strip('n')
score = [int(x.split('t')[4].strip())]
score = sum(score)
rank[x.split('t')[1].strip()] = score
rank_list = sorted(rank.items(), key=lambda x: x[1], reverse=True)
print(rank_list[:3])



