The reference given at the bottom most of this page can give you a detailed description on setup of Hadoop. I will take you through my experience in setting it up in Ubuntu.
You should have a linux/unix system with jvm installed and password-less ssh enabled.
Download the latest release of hadoop FROM
I prefer *.tar.gz to other installable packages because once you setup hadoop with installable packages, it will be hard for you to find the configuration files for any editing(from my experience; I removed it and installed with *.tar.gz).
Assuming that your browser downloaded the hadoop tar file to Downloads folder.
I chose /app folder to setup hadoop. So move the tar file to /app
Unzip and un-tar the file there:
You will need to edit the hadoop-env.sh file to set the JAVA_HOME environment variable.
If you try to start hadoop without this modification, hadoop will fail to start throwing the below error:
gedit is a text editor I am using. You can prefer your favourite(vi/vim/textedit/…)
location of hadoop-env.sh (hadoop<version>/conf)
You will find below lines in hadoop-env.sh
Either edit the already existing line of add a new line as I did:
You can know about your specific location with following commands:
As you can see, I highlighted /usr/lib/jvm/java-7-oracle/jre/bin/java. hadoop expects us to specify the path till java-7-oracle ie. “/usr/lib/jvm/java-7-oracle”
This will be enough to kick-start your hadoop in stand-alone mode.
Since I plan to install Apache Pig for scripting, I will setup hadoop in pseudo Distributed mode. For that I need to edit three files: core-site.xml, hdfs-site.xml and mapred-site.xml which can be found in “hadoop<version>/conf/” directory. The same information can be found in the reference as well.
Now the recipe is ready. Before I can start hadoop there is this one final thing to be done: formatting of name-node. Assuming that you are in the hadoop main directory, run the command: “bin/hadoop namenode -format”
And you will see logs like below:
Done with the waiting part. Run the command “bin/start-all.sh” to run NameNode, Secondary NameNode, Data Node, Task Tracker and Job Tracker as back-end processes.
To ensure that all five services are running, use jps command. If you see the below output, “ALL IS WELL..”
Out of my experience in setting it up in different linux and unix variants including Mac, I can say, the same steps can be repeated in any *nix variants.
Big Data is a Big Deal.. 🙂