Skip to content Skip to sidebar Skip to footer

Using Hadoop To Run A Jar File - Python

I have an existing Python program that has a sequence of operations that goes something like this: Connect to MySQL DB and retrieve files into local FS. Run a program X that opera

Solution 1:

You absolutely can use hadoop mapreduce framework to complete your work, but the answer for if it's a good idea could be "it depends". It depends the number and sizes of files you want to proceed.

Keep in mind that hdfs is not very good at deal with small files, it could be a disaster for the namenode if you have a good number (say 10 million) of small files (size is less than 1k bytes). An another hand, if the sizes are too large but only a few files are needed to proceed, it is not cool to just wrap step#2 directly in a mapper, because the job won't be spread widely and evenly (in this situation i guess the key-value only can be "file no. - file content" or "file name - file content" given you mentioned X can't changed in any way. Actually, "line no. - line" would be more situable)

BTW, there are 2 ways to utilize hadoop mapreduce framework, one way is you write mapper/reducer in java and compile them in jar then run mapreduce job with hadoop jar you_job.jar . Another way is streaming, you can write mapper/reducer by using python is this way.

Post a Comment for "Using Hadoop To Run A Jar File - Python"