Apache hadoop gives you option to program your mapper and reducer in
your favourite language.If you wonder about its possibility, you will
know it by yourself by going through this blog. Since python got into my favourite-language list recently,
let me try with it. Python already has a module: pydoop which
provides you with API to program map reduce. But this time,
we will program without using pydoop, thereby you will get
an idea how you can achieve the same in your preferred programming
language. Apache hadoop comes with a streaming jar which takes as
parameters: your mapper program, your reducer program, input file and
output file. It then streams the data, in your input file, to the
stdin (if I am going a bit too technical here, refer standard streams)
of mapper program. Your mapper program is supposed to read from stdin,
process the data and write to stdout as key-value pairs; its completely
upto you to choose the separator for key and value, since you are
going to get back those key-value data. The stdout of mapper is then
taken by the hadoop-streaming-jar, sorts the data from all mapper's
execution(fyi: mapper program is executed in all the data nodes, the
input file is chunked and stored), sorts that data based on key and
writes to the stdin of reducer program. Your reducer program should be
in such a way that it should read from stdin a key-value pair per line
and do the necessary processing to print out the final processed data.
Now, I will give you a feel of how things are going to work-out with a
character-count program, implemented in python, which will give your
the count of all alphabets in the input data.
Now let me show you my mapper and reducer code:
In charcount_mapper.py, I read line by line from stdin and go through
each line character by character and check if it's an alphabet. If it's
an alphabet, I print it to stdout in the format: "<character><tab-space>1". This
means to reducer program that the character appeared one time. Here
the key is <character> and value is '1'(one). There can be multiple
occurrence of same "<character><tab-space>1" depending on the input data.
Now lets analyze charcount_reducer.py. In this, I read line by line
from stdin, since its what I wrote into stdout from mapper program, I
can foretell that every line will be of the form:
"<character><tab-space>1". The only difference between what I wrote
into stdout from mapper program and what I get from stdin in reducer
program is that input will be sorted based on keys when I read in
reducer program. This will be helpful for me to construct logic for
reducer program. Now I just need to see if a new key is encountered.
Till then I keep on incrementing the counter. Once a new key is found,
the old key along with the counter value is printed out and counter is
reset. In the for loop, key along with counter value is printed out
only when a new key is encountered. Therefore I add one more print
statement at the end of the program to print out the last key and count.
(In case you are confused about the use of last print statement outside
Since it will be hard to debug programs in hadoop I will ensure the
functionality of my program locally. I will use a sample input:
You can easily predict the output our program should give out. Lets
see if we can get the same from the program.
Keep eyes on the command used in each screen-shot.
I will use the 'cat' command to print out the contents of the input file.
It is then piped to mapper program.
As I said this will the output of mapper program and to feed into
reducer program, for now, we will have to explicitly sort it.
Now its ready to be fed to the reducer program:
The output is as expected, isn't it!
Now lets run the same program in hadoop setup to see its success.
For that, start the hadoop running the start-all.sh script. (Refer the
part 1 of Big Data series in case of any confusion)
Then we need to copy our sample input file into HDFS file system. Know
the command to do it? Let me help you..
Before that I will create a directory for our use:
Now we have a directory "charcount" in the path /user/thinker/ in the
Hadoop file-system. Lets copy our input file from my local file-system
to hadoop file-system.
Lets ensure that the file's existence and its content:
Now we are sure about our input. Lets further with the execution.
For that, the command is:
To ensure the availability of the hadoop-streaming jar, run the command:
This is the jar which does the job of read the contents of input file
and feeding it to mapper program and ... (rest you already know)
the "-file" says the files which has your programs. The programs need
not be copies to HDFS. I used two "-file" to mention my mapper and
reducer files. "-mapper" mentions the mapper program's file name(only
file-name and not entire path). "-reducer" mentions the reducer program's
file-name. "-input" is used to mention the hdfs-absolute-path of input
file and "-output" mentions the output directory to which the output will
be written to.
Lets see the output of successful execution of the above command:
Listing the contents of /user/thinker/charcount/:
We can see a new directory with the name: sample_output.
Note: The command will fail to execute if you give an already existing
file/folder name with '-output' option.
Lets list the sample_output:
Our output will be in 'part-*' files. Since the output size is very
small in this particular case, we have only one file. The number of
files increases with increase in output size.
Lets print the generated output for final verification.
Even though the output differs in order, the counts are correct. You
can run the program with some other input by changing the file given
with '-input' option. You can use any language to
program your mapper and reducer. Points to be noted are:
Scripting languages with its jvm/interpreter installed in all the datanodes is a must.
In case of compiled languages like C or C++, you will have to compile it first and
the executable file need to be mentioned with '-file', '-mapper' and '-reducer'.
With that, I think I covered almost everything needed for you to kickstart your mapreduce
programming in your preferred language. See you in next part..
Big Data is a Big Deal..
The reference given at the bottom most of this page can give you a detailed description on setup of Hadoop. I will take you through my experience in setting it up in Ubuntu.
You should have a linux/unix system with jvm installed and password-less ssh enabled.
Download the latest release of hadoop FROM
I prefer *.tar.gz to other installable packages because once you setup hadoop with installable packages, it will be hard for you to find the configuration files for any editing(from my experience; I removed it and installed with *.tar.gz).
Assuming that your browser downloaded the hadoop tar file to Downloads folder.
I chose /app folder to setup hadoop. So move the tar file to /app
Unzip and un-tar the file there:
You will need to edit the hadoop-env.sh file to set the JAVA_HOME environment variable.
If you try to start hadoop without this modification, hadoop will fail to start throwing the below error:
gedit is a text editor I am using. You can prefer your favourite(vi/vim/textedit/…)
location of hadoop-env.sh (hadoop<version>/conf)
You will find below lines in hadoop-env.sh
Either edit the already existing line of add a new line as I did:
You can know about your specific location with following commands:
As you can see, I highlighted /usr/lib/jvm/java-7-oracle/jre/bin/java. hadoop expects us to specify the path till java-7-oracle ie. “/usr/lib/jvm/java-7-oracle”
This will be enough to kick-start your hadoop in stand-alone mode.
Since I plan to install Apache Pig for scripting, I will setup hadoop in pseudo Distributed mode. For that I need to edit three files: core-site.xml, hdfs-site.xml and mapred-site.xml which can be found in “hadoop<version>/conf/” directory. The same information can be found in the reference as well.
Now the recipe is ready. Before I can start hadoop there is this one final thing to be done: formatting of name-node. Assuming that you are in the hadoop main directory, run the command: “bin/hadoop namenode -format”
Done with the waiting part. Run the command “bin/start-all.sh” to run NameNode, Secondary NameNode, Data Node, Task Tracker and Job Tracker as back-end processes.
To ensure that all five services are running, use jps command. If you see the below output, “ALL IS WELL..”
Out of my experience in setting it up in different linux and unix variants including Mac, I can say, the same steps can be repeated in any *nix variants.
Big Data is a Big Deal.. 🙂