count()可以在内部使用,
agg()因为
groupBy表达式相同。
使用Python
import pyspark.sql.functions as funcnew_log_df.cache().withColumn("timePeriod", enpreUDF(new_log_df["START_TIME"])) .groupBy("timePeriod") .agg( func.mean("DOWNSTREAM_SIZE").alias("Mean"), func.stddev("DOWNSTREAM_SIZE").alias("Stddev"), func.count(func.lit(1)).alias("Num Of Records") ) .show(20, False)pySpark
SQL函数文档
与斯卡拉
import org.apache.spark.sql.functions._ //for count()new_log_df.cache().withColumn("timePeriod", enpreUDF(col("START_TIME"))) .groupBy("timePeriod") .agg( mean("DOWNSTREAM_SIZE").alias("Mean"), stddev("DOWNSTREAM_SIZE").alias("Stddev"), count(lit(1)).alias("Num Of Records") ) .show(20, false)count(1)将按等于
count("timePeriod")使用Java
import static org.apache.spark.sql.functions.*;new_log_df.cache().withColumn("timePeriod", enpreUDF(col("START_TIME"))) .groupBy("timePeriod") .agg( mean("DOWNSTREAM_SIZE").alias("Mean"), stddev("DOWNSTREAM_SIZE").alias("Stddev"), count(lit(1)).alias("Num Of Records") ) .show(20, false)


