Skip to content Skip to sidebar Skip to footer

Probnorm Function Equivalent In Pyspark

PROBNORM : explanation The PROBNORM function in SAS returns the probability that an observation from the standard normal distribution is less than or equal to x. Is there any equiv

Solution 1:

I'm afraid that in PySpark there is no such implemented method. However, you can exploit Pandas UDFs to define your own custom function using basic Python packages! Here we are going to use scipy.stats.norm module to get cumulative probabilities from a standard normal distribution.

Versions I'm using:

  • Spark 3.1.1
  • pandas 1.1.5
  • scipy 1.5.2

Example code

import pandas as pd
from scipy.stats import norm
import pyspark.sql.functions as F
from pyspark.sql.functions import pandas_udf


# create sample data
df = spark.createDataFrame([
    (1, 0.00),
    (2, -1.23),
    (3, 4.56),
], ['id', 'value'])


# define your custom Pandas UDF
@pandas_udf('double')
def probnorm(s: pd.Series) -> pd.Series:
    return pd.Series(norm.cdf(s))


# create a newcolumnusing the Pandas UDF
df = df.withColumn('pnorm', probnorm(F.col('value')))


df.show()

+---+-----+-------------------+| id|value|              pnorm|+---+-----+-------------------+|1|0.0|0.5||2|-1.23|0.10934855242569191||3|4.56|0.9999974423189606|+---+-----+-------------------+

Edit

If you do not have scipy properly installed on your workers too, you can use the Python base package math and a little bit of statistics knowledge.

import math
from pyspark.sql.functions import udf

defnormal_cdf(x, mu=0, sigma=1):
    """
    Cumulative distribution function for the normal distribution
    with mean `mu` and standard deviation `sigma`
    """return (1 + math.erf((x - mu) / (sigma * math.sqrt(2)))) / 2

my_udf = udf(normal_cdf)

df = df.withColumn('pnorm', my_udf(F.col('value')))

df.show()

+---+-----+-------------------+
| id|value|              pnorm|
+---+-----+-------------------+
|  1|  0.0|                0.5|
|  2|-1.23|0.10934855242569197|
|  3| 4.56| 0.9999974423189606|
+---+-----+-------------------+

Results are in fact the same.

Post a Comment for "Probnorm Function Equivalent In Pyspark"