Skip to content Skip to sidebar Skip to footer

Pyspark: Add The Average As A New Column To Dataframe

I am computing mean of a column in data-frame but it resulted in all the values zeros. Can someone help me in why this is happening? Following is the code and table before and afte

Solution 1:

You can compute the avg first for the whole column, then use lit() to add it as a variable to your DataFrame, there is no need for window functions:

from pyspark.sql.functions import lit

mean = df.groupBy().avg("dis_price_released").take(1)[0][0]
df.withColumn("test", lit(mean)).show()
 +------------------+----+|dis_price_released|test|+------------------+----+|0.0|2.5||4.0|2.5||4.0|2.5||4.0|2.5||1.0|2.5||4.0|2.5||4.0|2.5||0.0|2.5||4.0|2.5||0.0|2.5|+------------------+----+

Solution 2:

This is yet another way to solve the problem

df.withColumn("mean", lit(df.select(avg("dis_price_released").as("temp")).first().getAs("temp"))).show

Post a Comment for "Pyspark: Add The Average As A New Column To Dataframe"