您应该考虑使用pyspark sql模块函数而不是编写a
UDF,有几个
regexp基于函数:
首先让我们从一个更完整的示例数据框架开始:
df = sc.parallelize([["a","b","foo is tasty"],["12","34","blah blahhh"],["yeh","0","bar of yums"],['haha', '1', 'foobar none'], ['hehe', '2', 'something bar else']]) .toDF(["col1","col2","col_with_text"])
如果要根据行是否包含中的单词之一来过滤行
words_list,可以使用
rlike:
import pyspark.sql.functions as psfwords_list = ['foo','bar']df.filter(psf.col('col_with_text').rlike('(^|s)(' + '|'.join(words_list) + ')(s|$)')).show() +----+----+------------------+ |col1|col2| col_with_text| +----+----+------------------+ | a| b| foo is tasty| | yeh| 0| bar of yums| |hehe| 2|something bar else| +----+----+------------------+如果要提取与正则表达式匹配的字符串,可以使用
regexp_extract:
df.withColumn( 'extracted_word', psf.regexp_extract('col_with_text', '(?=^|s)(' + '|'.join(words_list) + ')(?=s|$)', 0)) .show() +----+----+------------------+--------------+ |col1|col2| col_with_text|extracted_word| +----+----+------------------+--------------+ | a| b| foo is tasty|foo| | 12| 34| blah blahhh| | | yeh| 0| bar of yums|bar| |haha| 1| foobar none| | |hehe| 2|something bar else| | +----+----+------------------+--------------+


