Is It Possible To Scale Data By Group In Spark?
I want to scale data with StandardScaler (from pyspark.mllib.feature import StandardScaler), by now I can do it by passing the values of RDD to transform function, but the problem
Solution 1:
Not exactly a pretty solution but you can adjust my answer to the similar Scala question. Lets start with an example data:
import numpy as np
np.random.seed(323)
keys = ["foo"] * 50 + ["bar"] * 50
values = (
np.vstack([np.repeat(-10, 500), np.repeat(10, 500)]).reshape(100, -1) +
np.random.rand(100, 10)
)
rdd = sc.parallelize(zip(keys, values))
Unfortunately MultivariateStatisticalSummary
is just a wrapper around a JVM model and it is not really Python friendly. Luckily with NumPy array we can use standard StatCounter
to compute statistics by key:
from pyspark.statcounter import StatCounter
defcompute_stats(rdd):
return rdd.aggregateByKey(
StatCounter(), StatCounter.merge, StatCounter.mergeStats
).collectAsMap()
Finally we can map
to normalize:
defscale(rdd, stats):
defscale_(kv):
k, v = kv
return (v - stats[k].mean()) / stats[k].stdev()
return rdd.map(scale_)
scaled = scale(rdd, compute_stats(rdd))
scaled.first()
## array([ 1.59879188, -1.66816084, 1.38546532, 1.76122047, 1.48132643,## 0.01512487, 1.49336769, 0.47765982, -1.04271866, 1.55288814])
Post a Comment for "Is It Possible To Scale Data By Group In Spark?"