How To Guarantee Repartitioning In Spark Dataframe
I'm pretty new to Apache Spark and I'm trying to repartition a dataframe by U.S. State. I then want to break each partition into its own RDD and save to a specific location: schema
Solution 1:
There is nothing unexpected going on here. Spark is using hash of the partitioning key (positive) modulo number of partitions to distribute rows between partitions and with 50 partitions you'll get a significant number of duplicates:
from pyspark.sql.functions import expr
states = sc.parallelize([
"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"
])
states_df = states.map(lambda x: (x, )).toDF(["state"])
states_df.select(expr("pmod(hash(state), 50)")).distinct().count()
# 26
If you want to separate files on write it is better to use partitionBy
clause for DataFrameWriter
. It will create separate output per level and doesn't require shuffling.
If you really want to go with full repartitioning you can use RDD API which allows you to use custom partitioner.
Post a Comment for "How To Guarantee Repartitioning In Spark Dataframe"