火花 > = 2.4
您可以使用
concat功能(SPARK-23736):
from pyspark.sql.functions import col, concatdf.select(concat(col("tokens"), col("tokens_bigrams"))).show(truncate=False)# +---------------------------------+ # |concat(tokens, tokens_bigrams) |# +---------------------------------+# |[one, two, two, one two, two two]|# |null |# +---------------------------------+要保留其中一个值时的数据,
NULL可以
coalesce使用
array:
from pyspark.sql.functions import array, coalescedf.select(concat( coalesce(col("tokens"), array()), coalesce(col("tokens_bigrams"), array()))).show(truncate = False)# +--------------------------------------------------------------------+# |concat(coalesce(tokens, array()), coalesce(tokens_bigrams, array()))|# +--------------------------------------------------------------------+# |[one, two, two, one two, two two] |# |[three] |# +--------------------------------------------------------------------+火花 <2.4
不幸的是
array,一般情况下要串联列,您将需要一个UDF,例如:
from itertools import chainfrom pyspark.sql.functions import col, udffrom pyspark.sql.types import *def concat(type): def concat_(*args): return list(chain.from_iterable((arg if arg else [] for arg in args))) return udf(concat_, ArrayType(type))
可以用作:
df = spark.createDataframe( [(["one", "two", "two"], ["one two", "two two"]), (["three"], None)], ("tokens", "tokens_bigrams"))concat_string_arrays = concat(StringType())df.select(concat_string_arrays("tokens", "tokens_bigrams")).show(truncate=False)# +---------------------------------+# |concat_(tokens, tokens_bigrams) |# +---------------------------------+# |[one, two, two, one two, two two]|# |[three] |# +---------------------------------+


