Skip to content Skip to sidebar Skip to footer

Creating Multiple Pyspark Dataframes From A Single Dataframe

I need to dynamically create multiple dataframes in pyspark based on the values available in the python list My dataframe(df) has data: date gender balance 2018-01-01 M

Solution 1:

I am not sure if you can create the names of dataframes dynamically in PySpark. In Python, you cannot even dynamically assign the names of variables, let alone dataframes.

One way is to create a dictionary of the dataframes, where the key corresponds to each date and the value of that dictionary corresponds to the dataframe.

For Python: Refer to this link, where someone has asked a similar Q on name dynamism.

Here is a small PySpark implementation -

from pyspark.sql.functions import col
values = [('2018-01-01','M',100),('2018-02-01','F',100),('2018-03-01','M',100)]
df = sqlContext.createDataFrame(values,['date','gender','balance'])
df.show()
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-01-01|     M|    100|
|2018-02-01|     F|    100|
|2018-03-01|     M|    100|
+----------+------+-------+

# Creating a dictionary to store the dataframes.
# Key: It contains the date from my_list.
# Value: Contains the corresponding dataframe.
dictionary_df = {}  

my_list = ['2018-01-01', '2018-02-01', '2018-03-01']
for i in my_list:
    dictionary_df[i] = df.filter(col('date')==i)

for i in my_list:
    print('DF: '+i)
    dictionary_df[i].show() 

DF: 2018-01-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-01-01|     M|    100|
+----------+------+-------+

DF: 2018-02-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-02-01|     F|    100|
+----------+------+-------+

DF: 2018-03-01
+----------+------+-------+
|      date|gender|balance|
+----------+------+-------+
|2018-03-01|     M|    100|
+----------+------+-------+print(dictionary_df)
    {'2018-01-01': DataFrame[date: string, gender: string, balance: bigint], '2018-02-01': DataFrame[date: string, gender: string, balance: bigint], '2018-03-01': DataFrame[date: string, gender: string, balance: bigint]}

Post a Comment for "Creating Multiple Pyspark Dataframes From A Single Dataframe"