在Spark> =
1.5中,您可以使用
size功能:
from pyspark.sql.functions import col, sizedf = sqlContext.createDataframe([ (["L", "S", "Y", "S"], ), (["L", "V", "I", "S"], ), (["I", "A", "N", "A"], ), (["I", "L", "S", "A"], ), (["E", "N", "N", "Y"], ), (["E", "I", "M", "A"], ), (["O", "A", "N", "A"], ), (["S", "U", "S"], )], ("tokens", ))df.where(size(col("tokens")) <= 3).show()## +---------+## | tokens|## +---------+## |[S, U, S]|## +---------+在Spark <1.5中,UDF应该可以解决问题:
from pyspark.sql.types import IntegerTypefrom pyspark.sql.functions import udfsize_ = udf(lambda xs: len(xs), IntegerType())df.where(size_(col("tokens")) <= 3).show()## +---------+## | tokens|## +---------+## |[S, U, S]|## +---------+如果您使用的
HiveContext则
sizeUDF与原始的SQL应与任何版本:
df.registerTempTable("df")sqlContext.sql("SELECt * FROM df WHERe size(tokens) <= 3").show()## +--------------------+## | tokens|## +--------------------+## |ArrayBuffer(S, U, S)|## +--------------------+对于字符串列,可以使用
udf上面的定义或
length函数:
from pyspark.sql.functions import lengthdf = sqlContext.createDataframe([("fooo", ), ("bar", )], ("k", ))df.where(length(col("k")) <= 3).show()## +---+## | k|## +---+## |bar|## +---+


