How To Guarantee Repartitioning In Spark Dataframe

August 19, 2022 Post a Comment

I'm pretty new to Apache Spark and I'm trying to repartition a dataframe by U.S. State. I then want to break each partition into its own RDD and save to a specific location: schema

Solution 1:

There is nothing unexpected going on here. Spark is using hash of the partitioning key (positive) modulo number of partitions to distribute rows between partitions and with 50 partitions you'll get a significant number of duplicates:

from pyspark.sql.functions import expr

states = sc.parallelize([
    "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
    "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
    "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
    "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
    "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
])

states_df = states.map(lambda x: (x, )).toDF(["state"])

states_df.select(expr("pmod(hash(state), 50)")).distinct().count()
# 26

If you want to separate files on write it is better to use partitionBy clause for DataFrameWriter. It will create separate output per level and doesn't require shuffling.

If you really want to go with full repartitioning you can use RDD API which allows you to use custom partitioner.

Python Guru

How To Guarantee Repartitioning In Spark Dataframe

Solution 1:

Post a Comment for "How To Guarantee Repartitioning In Spark Dataframe"