Creating Multiple Pyspark Dataframes From A Single Dataframe
I need to dynamically create multiple dataframes in pyspark based on the values available in the python list My dataframe(df) has data: date gender balance 2018-01-01 M
Solution 1:
I am not sure if you can create the names of dataframes dynamically in PySpark
. In Python, you cannot even dynamically assign the names of variables, let alone dataframes
.
One way is to create a dictionary of the dataframes
, where the key
corresponds to each date
and the value
of that dictionary corresponds to the dataframe.
For Python: Refer to this link, where someone has asked a similar Q on name dynamism.
Here is a small PySpark
implementation -
from pyspark.sql.functions import col
values = [('2018-01-01','M',100),('2018-02-01','F',100),('2018-03-01','M',100)]
df = sqlContext.createDataFrame(values,['date','gender','balance'])
df.show()
+----------+------+-------+
| date|gender|balance|
+----------+------+-------+
|2018-01-01| M| 100|
|2018-02-01| F| 100|
|2018-03-01| M| 100|
+----------+------+-------+
# Creating a dictionary to store the dataframes.
# Key: It contains the date from my_list.
# Value: Contains the corresponding dataframe.
dictionary_df = {}
my_list = ['2018-01-01', '2018-02-01', '2018-03-01']
for i in my_list:
dictionary_df[i] = df.filter(col('date')==i)
for i in my_list:
print('DF: '+i)
dictionary_df[i].show()
DF: 2018-01-01
+----------+------+-------+
| date|gender|balance|
+----------+------+-------+
|2018-01-01| M| 100|
+----------+------+-------+
DF: 2018-02-01
+----------+------+-------+
| date|gender|balance|
+----------+------+-------+
|2018-02-01| F| 100|
+----------+------+-------+
DF: 2018-03-01
+----------+------+-------+
| date|gender|balance|
+----------+------+-------+
|2018-03-01| M| 100|
+----------+------+-------+print(dictionary_df)
{'2018-01-01': DataFrame[date: string, gender: string, balance: bigint], '2018-02-01': DataFrame[date: string, gender: string, balance: bigint], '2018-03-01': DataFrame[date: string, gender: string, balance: bigint]}
Post a Comment for "Creating Multiple Pyspark Dataframes From A Single Dataframe"