Frustratingly enough, at some points in our life as a big data guru, we will have no fcuking clues what is still going wrong under the hook of the hadoop (yep, it is like a black box) after you have exhausted all your hours on the sea of log files in every node. A real world example would be you are manually building up a hadoop cluster and, let say, you have a problem in starting the hadoop-yarn-resource-manager on the master node in which gives nothing useful hints in the logs. If that happens, wouldn’t it be nice to put some more debug statements and twist around some mystery methods in the code then build it out and deploy to the cluster and watch? Is it possible to do that? Well, Yes, it is! In this tutorial, i will show how we can build the Cloudera’s Distribution (CDH) for Apache Hadoop manually and inspect what is going on when it is run on the cluster.
Assumptions I am making:
a) the linux ubuntu box to build the source
b) the box has maven 3.x installed already
c) the hadoop-2.6.0-cdh5.4.3 version
Step 1: Go to the Cloudera archive to grab the source tar.gz. I am now grabbing the hadoop-2.6.0-cdh5.4.3-src.tar.gz
Step 2: Download and unzip it to the local directory (i am assuming you know how to use the wget to download and gunzip and tar commands, if not…this tutorial might be too much to you i will advice you stop reading for now) and build it with
mvn clean package -DskipTests=true
At first, you will see this error:
It was because the protocol buffers was not installed on the box. To install it properly, follow these steps:
tar xzf protobuf-2.5.0.tar.gz
sudo make install
and now run the “mvn clean package -DskipTests=true” again. The build should be sucessful.
Since I am debugging the ResourceManager in the org.apache.hadoop.yarn.server.resourcemanager, after using the linux find jar command, i know it is located in the ./hadoop-2.6.0-cdh5.4.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-yarn-server-resourcemanager-2.6.0-cdh5.4.3.jar
i can now go freely modify the source code and re-build the project again and deploy this jar file to the cluster. In this problem i am having is the ResourceManager.java. I would like to add some debug statements (yea…old school ways lol) and build the project again as follow:
After i build it again with maven and deploy to the cluster, I run the hadoop yarn manager again on the cluster now and i am able to see my customized debug statements in the log file, yah~!
I hope this tutorial does show you guys how to debug the huge hadoop distributions. Of course, this technique can apply to every open source java projects. Have fun in debugging 🙂