Skip to content Skip to sidebar Skip to footer

Mrjob And Python - .csv File Output For Reducer?

I'm using the MRJob module for python 2.7. I have created a class that inherits from MRJob, and have correctly mapped everything using the inherited mapper function. Problem is, I

Solution 1:

To manage input and output formats in mrjob, you need to use protocols.

Luckily, there is an existing package which implements a CSV protocol that you could use - https://pypi.python.org/pypi/mr3px

Import the package in your job script

from mr3px.csvprotocolimportCsvProtocol

Specify the protocol in your job class

classCsvOutputJob(MRJob):
    ...
    OUTPUT_PROTOCOL = CsvProtocol  # write output as CSV

And then just yield your list (or tuple) of fields

defreducer(self, geo_key, info_list):
    for row in info_list:
        yield (None, row) 

Note that you cannot reliably add a header row to this output because Hadoop will use several reducers to generate the output in parallel.

To use this package on EMR, you'll need to install it during the instance bootstrap phase by adding an item to the bootstrap section of your config.

runners:emr:...bootstrap:-sudoapt-getinstall-ypython-setuptools-sudoeasy_installpip-sudopipinstallmr3px

disclaimer - I am the maintainer of the mr3px package, which is forked from mr3po

Post a Comment for "Mrjob And Python - .csv File Output For Reducer?"