Skip to content Skip to sidebar Skip to footer

Loading A Defaultdict In Hadoop Using Pickle And Sys.stdin

I posted a similar question about an hour ago, but have since deleted it after realising I was asking the wrong question. I have the following pickled defaultdict: ccollections def

Solution 1:

Why is your input data in the pickle format? Where does your input data come from? One of the goals of Hadoop/MapReduce is to process data that's too large to fit into the memory of a single machine. Thus, reading the whole input data and then trying to deserialize it runs contrary to the MR design paradigm and most likely won't even work with production-scale data sets.

The solution is to format your input data as a, for example, TSV text file with exactly one tuple of your dictionary per row. You can then process each tuple on its own, e.g.:

for line in sys.stdin:
    tuple = line.split("\t")
    key, value = process(tuple)
    emit(key, value)

Solution 2:

If you read in the data completely, I believe you can use pickle.loads().

myDict = pickle.loads(sys.stdin.read())

Post a Comment for "Loading A Defaultdict In Hadoop Using Pickle And Sys.stdin"