How to debug and build your own Cloudera Hadoop CDH libraries

Frustratingly enough, at some points in our life as a big data guru, we will have no fcuking clues what is still going wrong under the hook of the hadoop (yep, it is like a black box) after you have exhausted all your hours on the sea of log files in every node.  A real world example would be you are manually building up a hadoop cluster and, let say, you have a problem in starting the hadoop-yarn-resource-manager on the master node in which gives nothing useful hints in the logs.  If that happens, wouldn’t it be nice to put some more debug statements and twist around some mystery methods in the code then build it out and deploy to the cluster and watch?  Is it possible to do that? Well, Yes, it is!  In this tutorial, i will show how we can build the Cloudera’s Distribution (CDH) for Apache Hadoop manually and inspect what is going on when it is run on the cluster.

Assumptions I am making:
a) the linux ubuntu box to build the source
b) the box has maven 3.x installed already
c) the hadoop-2.6.0-cdh5.4.3 version

Step 1: Go to the Cloudera archive to grab the source tar.gz. I am now grabbing the hadoop-2.6.0-cdh5.4.3-src.tar.gz

Step 2: Download and unzip it to the local directory (i am assuming you know how to use the wget to download and gunzip and tar commands, if not…this tutorial might be too much to you i will advice you stop reading for now) and build it with

mvn clean package -DskipTests=true

At first, you will see this error:

It was because the protocol buffers was not installed on the box.  To install it properly, follow these steps:

wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
tar xzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure
make
sudo make install
sudo ldconfig

and now run the “mvn clean package -DskipTests=true” again.  The build should be sucessful.

Since I am debugging the ResourceManager in the org.apache.hadoop.yarn.server.resourcemanager, after using the linux find jar command, i know it is located in the ./hadoop-2.6.0-cdh5.4.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-yarn-server-resourcemanager-2.6.0-cdh5.4.3.jar

i can now go freely modify the source code and re-build the project again and deploy this jar file to the cluster.   In this problem i am having is the ResourceManager.java.  I would like to add some debug statements (yea…old school ways lol) and build the project again as follow:

After i build it again with maven and deploy to the cluster, I run the hadoop yarn manager again on the cluster now and i am able to see my customized debug statements in the log file, yah~!

I hope this tutorial does show you guys how to debug the huge hadoop distributions.  Of course, this technique can apply to every open source java projects.  Have fun in debugging 🙂

Tutorial: Setting up a python debugging environment in Eclipse

As being a savvy programmer, it is vital to have a debugger along with the development. Today I spent two+ hours configuring the python debugger in the Eclipse IDE. (Yes, i feel a need to set it up as i just started dealing with the massive python scripts at my work). After nailing it down i would like to share how I set up.

Prerequisites:

1) Eclipse IDE installed. (At the time of writing, I am using Eclipse Luna 4.4 version)
2) Pydev Eclipse plugin installed. (http://pydev.org/download.html)

Then, depends on which way we want, we can do:
–To debug a remote program
1) Inside the Eclipse, start the remote debug server. If we don’t find it in the tool bar, we can go to Window > Customize perspective > Command groups availability > PyDev debug

2) In the external python script, put these two lines at the begining:

import sys;sys.path.append(r’/Users/ken/eclipse/plugins/org.python.pydev_3.0.0.1388187472/pysrc’) #assuming this is the pydev installation path
import pydevd

3) In the external python script, put this line anywhere you want to have the program paused at the debugger:
pydevd.settrace();

3) Inside the Eclipse, go to the debug perspective

4) there you go, you should be able to pause the execution at where you put the statement at in step 3) above

–To debug a program inside a Eclipse
This is much more easier than debugging a remote program. It is pretty much like debugging a java program in Eclipse.

1) Create a debug configuration: Go to Run -> Debug Configurations -> Python Run, create a profile accordingly

2) Hit Debug