IndexToString
class pyspark.ml.feature.IndexToString(inputCol=None, outputCol=None, labels=None)
将一列索引映射回对应字符串值的新列的转换器。 索引字符串映射要么来自输入列的 ML 属性,要么来自用户提供的标签(优先于 ML 属性)。 请参阅 StringIndexer 将字符串转换为索引。
01.初始化:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.Driver.host","192.168.1.4")
.config("spark.ui.showConsoleProgress","false")
.appName("IndexToString").master("local[*]").getOrCreate()
02.创建数据:
df = spark.createDataframe([(1,"hello"),(2,"world"),(3,"hello"),(4,"spark")]
,["indexcol","text"])
df.show()
输出结果:
+--------+-----+ |indexcol| text| +--------+-----+ | 1|hello| | 2|world| | 3|hello| | 4|spark| +--------+-----+
03.使用StringIndexer数值化
from pyspark.ml.feature import StringIndexer stringIndexer = StringIndexer(inputCol="text",outputCol="stringindex") model1 = stringIndexer.fit(df) temp1df = model1.transform(df) temp1df.show()
输出结果:
+--------+-----+-----------+ |indexcol| text|stringindex| +--------+-----+-----------+ | 1|hello| 0.0| | 2|world| 2.0| | 3|hello| 0.0| | 4|spark| 1.0| +--------+-----+-----------+
04.查看结构:
temp1df.printSchema()
输出结果:
root |-- indexcol: long (nullable = true) |-- text: string (nullable = true) |-- stringindex: double (nullable = false)
05.将数值化的结果还原回字符串
from pyspark.ml.feature import IndexToString indexToString = IndexToString(inputCol="stringindex",outputCol="oritext") oridf = indexToString.transform(temp1df) oridf.show()
输出结果:
+--------+-----+-----------+-------+ |indexcol| text|stringindex|oritext| +--------+-----+-----------+-------+ | 1|hello| 0.0| hello| | 2|world| 2.0| world| | 3|hello| 0.0| hello| | 4|spark| 1.0| spark| +--------+-----+-----------+-------+



