Big Data Series; Part 2: Program hadoop mapreduce in your favourite language.

Image

Apache hadoop gives you option to program your mapper and reducer in 
your favourite language.If you wonder about its possibility, you will
know it by yourself by going through this blog. Since python got into my favourite-language list recently,
let me try with it. Python already has a module: pydoop which
provides you with API to program map reduce. But this time,
we will program without using pydoop, thereby you will get
an idea how you can achieve the same in your preferred programming
language. Apache hadoop comes with a streaming jar which takes as
parameters: your mapper program, your reducer program, input file and
output file. It then streams the data, in your input file, to the
stdin (if I am going a bit too technical here, refer standard streams)
of mapper program. Your mapper program is supposed to read from stdin,
process the data and write to stdout as key-value pairs; its completely
upto you to choose the separator for key and value, since you are
going to get back those key-value data. The stdout of mapper is then
taken by the hadoop-streaming-jar, sorts the data from all mapper's
execution(fyi: mapper program is executed in all the data nodes, the
input file is chunked and stored), sorts that data based on key and
writes to the stdin of reducer program. Your reducer program should be
in such a way that it should read from stdin a key-value pair per line
and do the necessary processing to print out the final processed data.
Now, I will give you a feel of how things are going to work-out with a
character-count program, implemented in python, which will give your
the count of all alphabets in the input data.

Now let me show you my mapper and reducer code:
charcount_mapper.py:
Image


In charcount_mapper.py, I read line by line from stdin and go through
each line character by character and check if it's an alphabet. If it's
an alphabet, I print it to stdout in the format: "<character><tab-space>1". This
means to reducer program that the character appeared one time. Here
the key is <character> and value is '1'(one). There can be multiple
occurrence of same "<character><tab-space>1" depending on the input data.

charcount_reducer.py:
Image
Now lets analyze charcount_reducer.py. In this, I read line by line
from stdin, since its what I wrote into stdout from mapper program, I
can foretell that every line will be of the form:
"<character><tab-space>1". The only difference between what I wrote
into stdout from mapper program and what I get from stdin in reducer
program is that input will be sorted based on keys when I read in
reducer program. This will be helpful for me to construct logic for
reducer program. Now I just need to see if a new key is encountered.
Till then I keep on incrementing the counter. Once a new key is found,
the old key along with the counter value is printed out and counter is
reset. In the for loop, key along with counter value is printed out
only when a new key is encountered. Therefore I add one more print
statement at the end of the program to print out the last key and count.
(In case you are confused about the use of last print statement outside
the for-loop).
Since it will be hard to debug programs in hadoop I will ensure the
functionality of my program locally. I will use a sample input:
Image
You can easily predict the output our program should give out. Lets
see if we can get the same from the program.
Keep eyes on the command used in each screen-shot.
I will use the 'cat' command to print out the contents of the input file.
Image
It is then piped to mapper program.
Image
As I said this will the output of mapper program and to feed into
reducer program, for now, we will have to explicitly sort it.
Image
Now its ready to be fed to the reducer program:
Image
The output is as expected, isn't it!
Now lets run the same program in hadoop setup to see its success.
For that, start the hadoop running the start-all.sh script. (Refer the
part 1 of Big Data series in case of any confusion)
Then we need to copy our sample input file into HDFS file system. Know
the command to do it? Let me help you..
Before that I will create a directory for our use:
Image
Now we have a directory "charcount" in the path /user/thinker/ in the
Hadoop file-system. Lets copy our input file from my local file-system
to hadoop file-system.
Image
Lets ensure that the file's existence and its content:
Image
Image
Now we are sure about our input. Lets further with the execution.
For that, the command is:
Image

To ensure the availability of the hadoop-streaming jar, run the command:
Image
This is the jar which does the job of read the contents of input file
and feeding it to mapper program and ... (rest you already know)
the "-file" says the files which has your programs. The programs need
not be copies to HDFS. I used two "-file" to mention my mapper and
reducer files. "-mapper" mentions the mapper program's file name(only
file-name and not entire path). "-reducer" mentions the reducer program's
file-name. "-input" is used to mention the hdfs-absolute-path of input
file and "-output" mentions the output directory to which the output will
be written to.
Lets see the output of successful execution of the above command:
Image
Listing the contents of /user/thinker/charcount/:
Image
We can see a new directory with the name: sample_output.
Note: The command will fail to execute if you give an already existing
file/folder name with '-output' option.
Lets list the sample_output:
Image
Our output will be in 'part-*' files. Since the output size is very
small in this particular case, we have only one file. The number of
files increases with increase in output size.
Lets print the generated output for final verification.
Image
Even though the output differs in order, the counts are correct. You
can run the program with some other input by changing the file given
with '-input' option. You can use any language to
program your mapper and reducer. Points to be noted are:
Scripting languages with its jvm/interpreter installed in all the datanodes is a must.
In case of compiled languages like C or C++, you will have to compile it first and
the executable file need to be mentioned with '-file', '-mapper' and '-reducer'.
With that, I think I covered almost everything needed for you to kickstart your mapreduce
programming in your preferred language. See you in next part..

Big Data is a Big Deal.. :)

Published by

Sreejith

A strong believer of: 1. Knowledge is power 2. Progress comes from proper application of knowledge 3. Reverent attains wisdom 4. For one's own salvation, and for the welfare of the world

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s