Olologin的答案几乎是正确的,但我相信您想要做的是将RDD分为3个元组,而不是将RDD分为3个元组。为此,请尝试以下操作:
rdd = sc.parallelize(["e1", "e2", "e3", "e4", "e5", "e6", "e7", "e8", "e9", "e10"])transformed = rdd.zipWithIndex().groupBy(lambda (_, i): i / 3) .map(lambda (_, list): tuple([elem[0] for elem in list]))
在pyspark中运行时,我得到以下信息:
>>> from __future__ import print_function >>> rdd = sc.parallelize(["e1", "e2", "e3", "e4", "e5", "e6", "e7", "e8", "e9", "e10"])>>> transformed = rdd.zipWithIndex().groupBy(lambda (_, i): i / 3).map(lambda (_, list): tuple([elem[0] for elem in list]))>>> transformed.foreach(print)...('e4', 'e5', 'e6')('e10',)('e7', 'e8', 'e9')('e1', 'e2', 'e3')


