How to debug and build your own Cloudera Hadoop CDH libraries

Frustratingly enough, at some points in our life as a big data guru, we will have no fcuking clues what is still going wrong under the hook of the hadoop (yep, it is like a black box) after you have exhausted all your hours on the sea of log files in every node.  A real world example would be you are manually building up a hadoop cluster and, let say, you have a problem in starting the hadoop-yarn-resource-manager on the master node in which gives nothing useful hints in the logs.  If that happens, wouldn’t it be nice to put some more debug statements and twist around some mystery methods in the code then build it out and deploy to the cluster and watch?  Is it possible to do that? Well, Yes, it is!  In this tutorial, i will show how we can build the Cloudera’s Distribution (CDH) for Apache Hadoop manually and inspect what is going on when it is run on the cluster.

Assumptions I am making:
a) the linux ubuntu box to build the source
b) the box has maven 3.x installed already
c) the hadoop-2.6.0-cdh5.4.3 version

Step 1: Go to the Cloudera archive to grab the source tar.gz. I am now grabbing the hadoop-2.6.0-cdh5.4.3-src.tar.gz

Step 2: Download and unzip it to the local directory (i am assuming you know how to use the wget to download and gunzip and tar commands, if not…this tutorial might be too much to you i will advice you stop reading for now) and build it with

mvn clean package -DskipTests=true

At first, you will see this error:

It was because the protocol buffers was not installed on the box.  To install it properly, follow these steps:

tar xzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
sudo make install
sudo ldconfig

and now run the “mvn clean package -DskipTests=true” again.  The build should be sucessful.

Since I am debugging the ResourceManager in the org.apache.hadoop.yarn.server.resourcemanager, after using the linux find jar command, i know it is located in the ./hadoop-2.6.0-cdh5.4.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-yarn-server-resourcemanager-2.6.0-cdh5.4.3.jar

i can now go freely modify the source code and re-build the project again and deploy this jar file to the cluster.   In this problem i am having is the  I would like to add some debug statements (yea…old school ways lol) and build the project again as follow:

After i build it again with maven and deploy to the cluster, I run the hadoop yarn manager again on the cluster now and i am able to see my customized debug statements in the log file, yah~!

I hope this tutorial does show you guys how to debug the huge hadoop distributions.  Of course, this technique can apply to every open source java projects.  Have fun in debugging 🙂

Fixed the ‘new line’ character inside double-quote causing the csv parsing failure

The nature of my work, as being a big data architect, is to deal with lot of  huge amount of consumer data.  I guess one of the very big challenges in this field (i.e. big data processing) is that it has become extremely difficult to deal with a situation that at midnight your data processing pipeline broke for no reasons and the hadoop/spark library’s console do not give any useful hints to you neither.  This certainly will kill our sweet night if we are talking about few ten gigabytes of input data to be processed.  If the input file size is small like few kilobytes we could probably just download it to our laptop and play around with it and ultimately we will be able to solve the problem.  However, with size of tens of gigabyte, we probably would still have not much idea what the hell was going on after staying up till next day morning and spending all of your sleep hours digging out these giant but otherwise impossible-to-deal-with input files.   Recently, I encountered one of these instances…Yea..oh god…Crazy right?  To keep the story short, I figured out i could not eye-ball these big files.  Instead, I managed to write a handy script that can detect any bad lines out of all other normal lines in any given input file and once the problematic line(s) are found i then had to think of how to solve the problem.  With some lucks mixed of skills, the root clause i found was that there are some lines inside a CSV file containing some new line characters inside a double quote element.  For example, consider the following simple csv data which has just two lines:

Iphone,”{ ItemName : Cheez-It
21 Ounce}”,

It is supposed to be treated as just one single line since inside a double quote everything should be treated as a character (i.e. a content) but not anything else which has special meaning.  However, the current implementation of the Hadoop Pig library’s  getNext() fails to recognize this and it sees two lines of data.  Experience is telling me that this is certainly a bug.  So i forked off the apache-pig repo and fixed it (well, i just added the outer while loop in the whole logic technically) and then submitted a pull request to the project.

Here is the pull request link:

It goes without saying that, as being a professional software engineer, every code changes require a good unit test.  For this, i also created a new unit-test, testQuotedQuotesWithNewLines().  The attached is the screen shot of showing all existing and this new unit-test being run successfully.


Glad that i could nail this problem and also contribute back to the open-source community.  So here the July 4 long weekend here I come yayyyyy! Now May I freely glad a drink 🙂 ?