Skip to content Skip to sidebar Skip to footer

Managing Dependencies With Hadoop Streaming?

I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to ins

Solution 1:

If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip

However, see this issue for a caveat:

https://issues.apache.org/jira/browse/MAPREDUCE-596

Solution 2:

If you use Dumbo you can use -libegg to distribute egg files and auto-configure the Python runtime:

https://github.com/klbostee/dumbo/wiki/Short-tutorial#wiki-eggs_and_jars https://github.com/klbostee/dumbo/wiki/Configuration-files

Post a Comment for "Managing Dependencies With Hadoop Streaming?"