Spark Pandas_udf Is Not Faster

August 21, 2024 Post a Comment

I'm facing a heavy data transformation. In a nutshell, I have columns of data, each containing strings which correspond to some ordinals. For example, HIGH, MID and LOW. My objecti

Solution 1:

Why so slow? Because the Spark runs in JVM and pyspark doesn't (because its a python process) and to make it the process possible needs to move all data serializing and deserializing to JVM.

You can map the values with when and otherwise function and avoid the serialize and deserialize process, increasing the performance.

import numpy as np
import pandas as pd
import pyspark.sql.functions as f
from pyspark.shell import spark


deffresh_df(n=100000, seed=None):
    np.random.seed(seed)
    feat1 = np.random.choice(["HI", "LO", "MID"], size=n)
    feat2 = np.random.choice(["SMALL", "MEDIUM", "LARGE"], size=n)

    pdf = pd.DataFrame({
        "feat1": feat1,
        "feat2": feat2
    })
    return spark.createDataFrame(pdf)


df = fresh_df()
df = df.withColumn('feat1_mapped', f
                   .when(df.feat1 == f.lit('HI'), 1)
                   .otherwise(f.when(df.feat1 == f.lit('MID'), 2).otherwise(3)))

df = df.withColumn('feat2_mapped', f
                   .when(df.feat2 == f.lit('SMALL'), 0)
                   .otherwise(f.when(df.feat2 == f.lit('MEDIUM'), 1).otherwise(2)))
df.show(n=20)

Output

+-----+------+------------+------------+|feat1| feat2|feat1_mapped|feat2_mapped|+-----+------+------------+------------+|   LO| SMALL|3|0||   LO|MEDIUM|3|1||  MID|MEDIUM|2|1||  MID| SMALL|2|0||  MID|LARGE|2|2||  MID| SMALL|2|0||   LO| SMALL|3|0||  MID|LARGE|2|2||  MID|LARGE|2|2||  MID| SMALL|2|0||  MID|MEDIUM|2|1||   LO|LARGE|3|2||   HI|MEDIUM|1|1||   LO| SMALL|3|0||   HI|MEDIUM|1|1||  MID| SMALL|2|0||  MID|MEDIUM|2|1||   HI| SMALL|1|0||   HI|LARGE|1|2||  MID|LARGE|2|2|+-----+------+------------+------------+

Python Guru

Spark Pandas_udf Is Not Faster

Solution 1:

Post a Comment for "Spark Pandas_udf Is Not Faster"