Spark Pandas_udf Is Not Faster
I'm facing a heavy data transformation. In a nutshell, I have columns of data, each containing strings which correspond to some ordinals. For example, HIGH, MID and LOW. My objecti
Solution 1:
Why so slow? Because the Spark runs in JVM and pyspark
doesn't (because its a python process) and to make it the process possible needs to move all data serializing and deserializing to JVM.
You can map the values with when
and otherwise
function and avoid the serialize and deserialize process, increasing the performance.
import numpy as np
import pandas as pd
import pyspark.sql.functions as f
from pyspark.shell import spark
deffresh_df(n=100000, seed=None):
np.random.seed(seed)
feat1 = np.random.choice(["HI", "LO", "MID"], size=n)
feat2 = np.random.choice(["SMALL", "MEDIUM", "LARGE"], size=n)
pdf = pd.DataFrame({
"feat1": feat1,
"feat2": feat2
})
return spark.createDataFrame(pdf)
df = fresh_df()
df = df.withColumn('feat1_mapped', f
.when(df.feat1 == f.lit('HI'), 1)
.otherwise(f.when(df.feat1 == f.lit('MID'), 2).otherwise(3)))
df = df.withColumn('feat2_mapped', f
.when(df.feat2 == f.lit('SMALL'), 0)
.otherwise(f.when(df.feat2 == f.lit('MEDIUM'), 1).otherwise(2)))
df.show(n=20)
Output
+-----+------+------------+------------+|feat1| feat2|feat1_mapped|feat2_mapped|+-----+------+------------+------------+| LO| SMALL|3|0|| LO|MEDIUM|3|1|| MID|MEDIUM|2|1|| MID| SMALL|2|0|| MID|LARGE|2|2|| MID| SMALL|2|0|| LO| SMALL|3|0|| MID|LARGE|2|2|| MID|LARGE|2|2|| MID| SMALL|2|0|| MID|MEDIUM|2|1|| LO|LARGE|3|2|| HI|MEDIUM|1|1|| LO| SMALL|3|0|| HI|MEDIUM|1|1|| MID| SMALL|2|0|| MID|MEDIUM|2|1|| HI| SMALL|1|0|| HI|LARGE|1|2|| MID|LARGE|2|2|+-----+------+------------+------------+
Post a Comment for "Spark Pandas_udf Is Not Faster"