Linux Device Driver: 1 of n

Lets try for a “hello world” Linux driver this time. Yeah, I know there are heck lot of tutorials dealing with the same but since I am trying it out myself, I would like to share the steps I followed and the experience I gained out of it.
Drivers, not just in Linux but in all Operating Systems, lie directly on top of hardware, abstracting the device specific operations. This reduces the dependency of Operating System on hardware and thereby it makes easy for an operating system to switch between different hardware.
Didn’t get that? Let me try that once more. Before that let me tell you one fact: devices are categorized into three sets in Linux (this one is specific to Linux and Unix):

  1. Character Devices    – devices which read one byte at a time.
  2. Block Devices        – devices which a block (usually 512 bytes) at a time.
  3. Network Devices        – devices which enables transfer(to be specific, send and receive) of packets over network.

By categorizing devices, Linux expects driver of each category to expose a standard interface. Now what is the use of that? Suppose two devices dev1 and dev2 have its own specific driver driv1 and driv2. If both drivers has a common interface (functions through which it gives out its services), say both implemented f1() and f2() functions, then a program using driv1 one will call f1() and f2() to use the dev1. Suppose at one point of time the user wants to replace dev1 with dev2; since the driver driv2 too follows the standard and has implementation of f1() and f2(), there will be no change in the calling process and it need not be recompiled. Thereby we could isolate the changes to a very small section. This is a good programming practice even outside driver-programming.
In this post, we will try to write our first driver which can be used with Linux. Driver-programming is a little bit different from our usual programming. Program execution in our everyday programming starts in ‘main’ function but in driver-programming main function disappears. Lets see our hello-world without any further ado.

#include <linux/init.h>
#include <linux/module.h>
MODULE_LICENSE(“Dual BSD/GPL”);

static int hello_kernel_init(void)
{
    printk(KERN_ALERT “Hello Kernel, at your service..\n”);
    return 0;
}

static void hello_kernel_exit(void)
{
    printk( KERN_ALERT “Goodbye Kernel, I am no longer at your service..\n”);
}

module_init(hello_kernel_init);
module_exit(hello_kernel_exit);

Don’t worry about the weird function calls: printk. Hope you already got the idea about module_init and module_exit. Those are to register our functions (hello_kernel_init and hello_kernel_exit) as functions to be called at the time of module initialization and exit. Good question: “when does that happen”. A driver module is invoked as part of any software requesting for a hardware service. That request from application goes to operating system and o/s (short form for operating system) loads the driver module specific to that hardware to serve the application. Leave those topics for the time being, we will cover it in coming chapters. ‘printk’ is just a ditto of printf function in C language; the difference is it has an identifier -> KERN_ALERT at the starting of first argument which tells the kernel the priority of the message printed by that printk function. There are other similar identifiers like KERN_INFO, KERN_WARNING, etc . Driver-modules are not supposed to use any library functions other than those implemented in kernel. This is because we wont link our code with libraries at the time of compilation. Linking and loading is the job of kernel. So when kernel links our module, if it sees any reference to functions which it doesn’t implement, it will cause a failure of our code, if not an entire system failure at rare cases.
Forgot to tell you that my development environment is Ubuntu with kernel version 3.2.0
Now I need a makefile to ease my compilation. This simple code of-course should have a simple make-file as well and there it goes:
obj-m := hello_kernel.o
Yes, just one line and the kernel build system will take care of the rest.
At the time of ‘making’ we should give the path of our kernel source. To find that we should firt find the kernel version you are currently running on. Run ‘uname -r‘ at your terminal.

unamer

You will get something like “3.2.0-40-generic”, don’t worry if it’s just numbers. Now the kernel source tree path will be “/lib/modules/3.2.0-40-generic/build”. This will be a soft link to original path and this will do for us.
Now we have our source code in “hello_kernel.c” and make file “Makefile”.
Now run the command
make -C /lib/modules/3.2.0-40-generic/build/ M=`pwd` modules
This command starts by changing its directory to the one provided with -C option, which is your kernel source directory and finds the kernel’s top-level make-file. The M= option is to help the make utility to traverse back to our current directory to build target: modules

make
Now our module is ready to be loaded. Run the below commands.

insmodlsmodrmmod

You need to have sudo privilege since we are messing with kernel. insmod will load the driver module, lsmod will list all the loaded modules and we are grepping our module from the result and rmmod removes the driver.

Hope you notice that we had a few printk statements in our code but nothing appeared in the console. It is because all kernel specific logs go to “/var/log/syslog” file in Ubuntu. Likewise different Linux distros will have its own specific file. You will see whatever we printed out in our driver code in those file. Just run the below command in our case:

cat /var/log/syslog | tail

log

There ends our first step towards mastering Linux Device Drivers. Wait till the next chapter for advanced topics..

Big Data Series; Part 2: Program hadoop mapreduce in your favourite language.

Image

Apache hadoop gives you option to program your mapper and reducer in 
your favourite language.If you wonder about its possibility, you will
know it by yourself by going through this blog. Since python got into my favourite-language list recently,
let me try with it. Python already has a module: pydoop which
provides you with API to program map reduce. But this time,
we will program without using pydoop, thereby you will get
an idea how you can achieve the same in your preferred programming
language. Apache hadoop comes with a streaming jar which takes as
parameters: your mapper program, your reducer program, input file and
output file. It then streams the data, in your input file, to the
stdin (if I am going a bit too technical here, refer standard streams)
of mapper program. Your mapper program is supposed to read from stdin,
process the data and write to stdout as key-value pairs; its completely
upto you to choose the separator for key and value, since you are
going to get back those key-value data. The stdout of mapper is then
taken by the hadoop-streaming-jar, sorts the data from all mapper's
execution(fyi: mapper program is executed in all the data nodes, the
input file is chunked and stored), sorts that data based on key and
writes to the stdin of reducer program. Your reducer program should be
in such a way that it should read from stdin a key-value pair per line
and do the necessary processing to print out the final processed data.
Now, I will give you a feel of how things are going to work-out with a
character-count program, implemented in python, which will give your
the count of all alphabets in the input data.

Now let me show you my mapper and reducer code:
charcount_mapper.py:
Image


In charcount_mapper.py, I read line by line from stdin and go through
each line character by character and check if it's an alphabet. If it's
an alphabet, I print it to stdout in the format: "<character><tab-space>1". This
means to reducer program that the character appeared one time. Here
the key is <character> and value is '1'(one). There can be multiple
occurrence of same "<character><tab-space>1" depending on the input data.

charcount_reducer.py:
Image
Now lets analyze charcount_reducer.py. In this, I read line by line
from stdin, since its what I wrote into stdout from mapper program, I
can foretell that every line will be of the form:
"<character><tab-space>1". The only difference between what I wrote
into stdout from mapper program and what I get from stdin in reducer
program is that input will be sorted based on keys when I read in
reducer program. This will be helpful for me to construct logic for
reducer program. Now I just need to see if a new key is encountered.
Till then I keep on incrementing the counter. Once a new key is found,
the old key along with the counter value is printed out and counter is
reset. In the for loop, key along with counter value is printed out
only when a new key is encountered. Therefore I add one more print
statement at the end of the program to print out the last key and count.
(In case you are confused about the use of last print statement outside
the for-loop).
Since it will be hard to debug programs in hadoop I will ensure the
functionality of my program locally. I will use a sample input:
Image
You can easily predict the output our program should give out. Lets
see if we can get the same from the program.
Keep eyes on the command used in each screen-shot.
I will use the 'cat' command to print out the contents of the input file.
Image
It is then piped to mapper program.
Image
As I said this will the output of mapper program and to feed into
reducer program, for now, we will have to explicitly sort it.
Image
Now its ready to be fed to the reducer program:
Image
The output is as expected, isn't it!
Now lets run the same program in hadoop setup to see its success.
For that, start the hadoop running the start-all.sh script. (Refer the
part 1 of Big Data series in case of any confusion)
Then we need to copy our sample input file into HDFS file system. Know
the command to do it? Let me help you..
Before that I will create a directory for our use:
Image
Now we have a directory "charcount" in the path /user/thinker/ in the
Hadoop file-system. Lets copy our input file from my local file-system
to hadoop file-system.
Image
Lets ensure that the file's existence and its content:
Image
Image
Now we are sure about our input. Lets further with the execution.
For that, the command is:
Image

To ensure the availability of the hadoop-streaming jar, run the command:
Image
This is the jar which does the job of read the contents of input file
and feeding it to mapper program and ... (rest you already know)
the "-file" says the files which has your programs. The programs need
not be copies to HDFS. I used two "-file" to mention my mapper and
reducer files. "-mapper" mentions the mapper program's file name(only
file-name and not entire path). "-reducer" mentions the reducer program's
file-name. "-input" is used to mention the hdfs-absolute-path of input
file and "-output" mentions the output directory to which the output will
be written to.
Lets see the output of successful execution of the above command:
Image
Listing the contents of /user/thinker/charcount/:
Image
We can see a new directory with the name: sample_output.
Note: The command will fail to execute if you give an already existing
file/folder name with '-output' option.
Lets list the sample_output:
Image
Our output will be in 'part-*' files. Since the output size is very
small in this particular case, we have only one file. The number of
files increases with increase in output size.
Lets print the generated output for final verification.
Image
Even though the output differs in order, the counts are correct. You
can run the program with some other input by changing the file given
with '-input' option. You can use any language to
program your mapper and reducer. Points to be noted are:
Scripting languages with its jvm/interpreter installed in all the datanodes is a must.
In case of compiled languages like C or C++, you will have to compile it first and
the executable file need to be mentioned with '-file', '-mapper' and '-reducer'.
With that, I think I covered almost everything needed for you to kickstart your mapreduce
programming in your preferred language. See you in next part..

Big Data is a Big Deal.. :)

Big Data Series; Part 1: Set up Hadoop in ubuntu

Image

The reference given at the bottom most of this page can give you a detailed description on setup of Hadoop. I will take you through my experience in setting it up in Ubuntu.

You should have a linux/unix system with jvm installed and password-less ssh enabled.

Download the latest release of hadoop FROM

I prefer *.tar.gz to other installable packages because once you setup hadoop with installable packages, it will be hard for you to find the configuration files for any editing(from my experience; I removed it and installed with *.tar.gz).

Assuming that your browser downloaded the hadoop tar file to Downloads folder.

(in my case Image)

I chose /app folder to setup hadoop. So move the tar file to /app

Image

Unzip and un-tar the file there:

Image

You will need to edit the hadoop-env.sh file to set the JAVA_HOME environment variable.

If you try to start hadoop without this modification, hadoop will fail to start throwing the below error:

Image

gedit is a text editor I am using. You can prefer your favourite(vi/vim/textedit/…)

location of hadoop-env.sh (hadoop<version>/conf)

Image

You will find below lines in hadoop-env.sh

Image

Either edit the already existing line of add a new line as I did:

Image

You can know about your specific location with following commands:

Image

As you can see, I highlighted /usr/lib/jvm/java-7-oracle/jre/bin/java. hadoop expects us to specify the path till java-7-oracle ie. “/usr/lib/jvm/java-7-oracle”

This will be enough to kick-start your hadoop in stand-alone mode.

Since I plan to install Apache Pig for scripting, I will setup hadoop in pseudo Distributed mode. For that I need to edit three files: core-site.xml, hdfs-site.xml and mapred-site.xml which can be found in “hadoop<version>/conf/” directory. The same information can be found in the reference as well.

ImageImageImage

 

 

Now the recipe is ready. Before I can start hadoop there is this one final thing to be done: formatting of name-node. Assuming that you are in the hadoop main directory, run the command: “bin/hadoop namenode -format”

And you will see logs like below:Image

 

Done with the waiting part. Run the command “bin/start-all.sh” to run NameNode, Secondary NameNode, Data Node, Task Tracker and Job Tracker as back-end processes.

Image

To ensure that all five services are running, use jps command. If you see the below output, “ALL IS WELL..”

 

Image

Out of my experience in setting it up in different linux and unix variants including Mac, I can say, the same steps can be repeated in any *nix variants.

Big Data is a Big Deal.. 🙂

Reference:

http://hadoop.apache.org/docs/r1.1.1/single_node_setup.html