How To Get 1000 Records From Dataframe And Write Into A File Using Pyspark?
I am having 100,000+ of records in dataframe. I want to create a file dynamically and push 1000 records per file. Can anyone help me to solve this, thanks in advance.
Solution 1:
You can use maxRecordsPerFile
option while writing dataframe
.
- If you need whole dataframe to write 1000 records in each file then use
repartition(1)
(or)
write 1000 records for each partition use.coalesce(1)
Example:
# 1000 records written per file in each partition
df.coalesce(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)
# 1000 records written per file for dataframe 100 files created for100,000
df.repartition(1).write.option("maxRecordsPerFile", 1000).mode("overwrite").parquet(<path>)
#or by set config on spark session
spark.conf.set("spark.sql.files.maxRecordsPerFile", 1000)
#or
spark.sql("set spark.sql.files.maxRecordsPerFile=1000").show()
df.coalesce(1).write.mode("overwrite").parquet(<path>)
df.repartition(1).write.mode("overwrite").parquet(<path>)
Method-2:
Caluculating number of partitions then repartition the dataframe:
df = spark.range(10000)
#caluculate partitions
no_partitions=df.count()/1000from pyspark.sql.functions import *
#repartition and check number of records on each partition
df.repartition(no_partitions).\
withColumn("partition_id",spark_partition_id()).\
groupBy(col("partition_id")).\
agg(count("*")).\
show()
#+-----------+--------+#|partiton_id|count(1)|#+-----------+--------+#| 1| 1001|#| 6| 1000|#| 3| 999|#| 5| 1000|#| 9| 1000|#| 4| 999|#| 8| 1000|#| 7| 1000|#| 2| 1001|#| 0| 1000|#+-----------+--------+
df.repartition(no_partitions).write.mode("overwrite").parquet(<path>)
Solution 2:
Firstly, create a row number column
df = df.withColumn('row_num', F.row_number().over(Window.orderBy('any_column'))
Now, run a loop and keep saving the records.
for i in range(0, df.count(), 1000):
records = df.where(F.col("row_num").between(i, i+999))
records.toPandas().to_csv("file-{}.csv".format(i))
Post a Comment for "How To Get 1000 Records From Dataframe And Write Into A File Using Pyspark?"