Skip to content Skip to sidebar Skip to footer

Saving Dataframe To Parquet Takes Lot Of Time

I have a spark data frame which has around 458MM rows. It was initially an RDD so then I converted to spark data frame using sqlcontext.createDataFrame First few rows of RDD are a

Solution 1:

Two things I can think of to try:

  1. You might want to check the number of partitions you have. If you have too few partitions then you don't get the required parallelism.

  2. Spark does its stuff lazily. This means that it could be that the writing is very fast but the calculation in order to get to it is slow. What you can try to do is cache the dataframe (and perform some action such as count on it to make sure it materializes) and then try to write again. If the saving part is fast now then the problem is with the calculation and not the parquet writing.

Solution 2:

also try to increase cores if you have enough, this is one of the main thing because number cores is proportional to number of executors. So, that the parallel processing possible.

Post a Comment for "Saving Dataframe To Parquet Takes Lot Of Time"