Skip to content Skip to sidebar Skip to footer

How To Retrieve A Column From Pyspark Dataframe And And Insert It As New Column Within Existing Pyspark Dataframe?

The problem is: I've got a pyspark dataframe like this df1: +--------+ |index | +--------+ | 121| | 122| | 123| | 124| | 125| | 121| | 121| | 126

Solution 1:

Hope this helps!

import pyspark.sql.functions as f

df1 = sc.parallelize([[121],[122],[123]]).toDF(["index"])
df2 = sc.parallelize([[2.4899928731985597,-0.19775025821959014],[1.029654847161142,1.4878188087911541],
                        [-2.253992428312965,0.29853121635739804]]).toDF(["fact1","fact2"])

# since there is no common column between these two dataframes add row_index so that it can be joined
df1=df1.withColumn('row_index', f.monotonically_increasing_id())
df2=df2.withColumn('row_index', f.monotonically_increasing_id())

df2 = df2.join(df1, on=["row_index"]).sort("row_index").drop("row_index")
df2.show()

Don't forget to let us know if it solved your problem :)

Post a Comment for "How To Retrieve A Column From Pyspark Dataframe And And Insert It As New Column Within Existing Pyspark Dataframe?"