Skip to content Skip to sidebar Skip to footer

Spark Pandas_udf Is Not Faster

I'm facing a heavy data transformation. In a nutshell, I have columns of data, each containing strings which correspond to some ordinals. For example, HIGH, MID and LOW. My objecti

Solution 1:

Why so slow? Because the Spark runs in JVM and pyspark doesn't (because its a python process) and to make it the process possible needs to move all data serializing and deserializing to JVM.

You can map the values with when and otherwise function and avoid the serialize and deserialize process, increasing the performance.

import numpy as np
import pandas as pd
import pyspark.sql.functions as f
from pyspark.shell import spark


deffresh_df(n=100000, seed=None):
    np.random.seed(seed)
    feat1 = np.random.choice(["HI", "LO", "MID"], size=n)
    feat2 = np.random.choice(["SMALL", "MEDIUM", "LARGE"], size=n)

    pdf = pd.DataFrame({
        "feat1": feat1,
        "feat2": feat2
    })
    return spark.createDataFrame(pdf)


df = fresh_df()
df = df.withColumn('feat1_mapped', f
                   .when(df.feat1 == f.lit('HI'), 1)
                   .otherwise(f.when(df.feat1 == f.lit('MID'), 2).otherwise(3)))

df = df.withColumn('feat2_mapped', f
                   .when(df.feat2 == f.lit('SMALL'), 0)
                   .otherwise(f.when(df.feat2 == f.lit('MEDIUM'), 1).otherwise(2)))
df.show(n=20)

Output

+-----+------+------------+------------+|feat1| feat2|feat1_mapped|feat2_mapped|+-----+------+------------+------------+|   LO| SMALL|3|0||   LO|MEDIUM|3|1||  MID|MEDIUM|2|1||  MID| SMALL|2|0||  MID|LARGE|2|2||  MID| SMALL|2|0||   LO| SMALL|3|0||  MID|LARGE|2|2||  MID|LARGE|2|2||  MID| SMALL|2|0||  MID|MEDIUM|2|1||   LO|LARGE|3|2||   HI|MEDIUM|1|1||   LO| SMALL|3|0||   HI|MEDIUM|1|1||  MID| SMALL|2|0||  MID|MEDIUM|2|1||   HI| SMALL|1|0||   HI|LARGE|1|2||  MID|LARGE|2|2|+-----+------+------------+------------+

Post a Comment for "Spark Pandas_udf Is Not Faster"