Skip to content Skip to sidebar Skip to footer

How Can I Use Reducebykey Instead Of Groupbykey To Construct A List?

My RDD is made of many items, each of which is a tuple as follows: (key1, (val1_key1, val2_key1)) (key2, (val1_key2, val2_key2)) (key1, (val1_again_key1, val2_again_key1)) ... and

Solution 1:

The answer is you cannot (or at least not in a straightforward and Pythonic way without abusing language dynamism). Since values type and return type are different (a list of tuples vs a single tuple) reduce is not a valid function here. You could use combineByKey or aggregateByKey for example like this:

rdd = sc.parallelize([
    ("key1", ("val1_key1", "val2_key1")),
    ("key2", ("val1_key2", "val2_key2"))])

rdd.aggregateByKey([], lambda acc, x: acc + [x], lambda acc1, acc2: acc1 + acc2)

but it is just a less efficient version of groupByKey. See also Is groupByKey ever preferred over reduceByKey

Post a Comment for "How Can I Use Reducebykey Instead Of Groupbykey To Construct A List?"