Building a multi-node Spark cluster on your local machine using vagrant and puppet

I cannot imagine it has been 8 months already since my last post of building a Hadoop cluster.  Anyway, have you ever gone through some pains of setting up a single standalone Spark instance on your local machine at all and dealing with lot of configuration issues these kind of pains?  If yes, it is gonna be more hassle for the multi-node Spark version!  Now no more I can assure that, as with the Spark big data framework has been gaining more popularity, I am going to prepare a Vagrant-Puppet repo for Spark in order to demonstrate how to build one multi-node Spark cluster locally on your work station (In this tutorial, I will build a one Spark master node and two Spark worker node.). The benefit of having a test cluster locally can speed up software development and testing cycles without, needless to say.

For details, please go to:


ServerSynchronizer – the software that synchronizes all the files from a local server to all remote server(s)

Have you ever feel tired of only changing one line of the source file and deploying it to all remote servers manually – such as copy the source and ssh into the remote host and vi that file and paste it and save it?  This operation usually involves at least 6-10 keystrokes but in the long run this does not increase the productivity at all but fatigue and frustrations.  With this software, let say whenever you are editing the source or config files on your local work machine, all those files will get synchronized (i.e. updated) to the remote host(s) AUTOMATICALLY.  Yes, i know every human-being likes the word ‘automatically’!!! 🙂

To heal this pain, i wrote the ServerSynchronizer  –  to help.

For curiosity, Internally, this software uses the java nio library which employs the push protocol (rather than polling which is more inefficient) to detect any file changes within the root directories in the local server.

As of today, the folder/file remove is not implemented yet.  I will do it when i have time.

How to debug and build your own Cloudera Hadoop CDH libraries

Frustratingly enough, at some points in our life as a big data guru, we will have no fcuking clues what is still going wrong under the hook of the hadoop (yep, it is like a black box) after you have exhausted all your hours on the sea of log files in every node.  A real world example would be you are manually building up a hadoop cluster and, let say, you have a problem in starting the hadoop-yarn-resource-manager on the master node in which gives nothing useful hints in the logs.  If that happens, wouldn’t it be nice to put some more debug statements and twist around some mystery methods in the code then build it out and deploy to the cluster and watch?  Is it possible to do that? Well, Yes, it is!  In this tutorial, i will show how we can build the Cloudera’s Distribution (CDH) for Apache Hadoop manually and inspect what is going on when it is run on the cluster.

Assumptions I am making:
a) the linux ubuntu box to build the source
b) the box has maven 3.x installed already
c) the hadoop-2.6.0-cdh5.4.3 version

Step 1: Go to the Cloudera archive to grab the source tar.gz. I am now grabbing the hadoop-2.6.0-cdh5.4.3-src.tar.gz

Step 2: Download and unzip it to the local directory (i am assuming you know how to use the wget to download and gunzip and tar commands, if not…this tutorial might be too much to you i will advice you stop reading for now) and build it with

mvn clean package -DskipTests=true

At first, you will see this error:

It was because the protocol buffers was not installed on the box.  To install it properly, follow these steps:

tar xzf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
sudo make install
sudo ldconfig

and now run the “mvn clean package -DskipTests=true” again.  The build should be sucessful.

Since I am debugging the ResourceManager in the org.apache.hadoop.yarn.server.resourcemanager, after using the linux find jar command, i know it is located in the ./hadoop-2.6.0-cdh5.4.3/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/hadoop-yarn-server-resourcemanager-2.6.0-cdh5.4.3.jar

i can now go freely modify the source code and re-build the project again and deploy this jar file to the cluster.   In this problem i am having is the  I would like to add some debug statements (yea…old school ways lol) and build the project again as follow:

After i build it again with maven and deploy to the cluster, I run the hadoop yarn manager again on the cluster now and i am able to see my customized debug statements in the log file, yah~!

I hope this tutorial does show you guys how to debug the huge hadoop distributions.  Of course, this technique can apply to every open source java projects.  Have fun in debugging 🙂

Bit taste of a small accomplishment

Today, the patch of my Apache Pig got successfully recognized and merged into their PIG trunk repo! yeah

The fact that it failed to recognize and parse a particular VALID csv file format: It causes tons of our csv files from our data partners failed to be processed due to this bug. I had to fix it in order to get our big data pipeline process otherwise we will loss big bucks.  For more technical details of the fix, please refer to:

Fixed the ‘new line’ character inside double-quote causing the csv parsing failure

The nature of my work, as being a big data architect, is to deal with lot of  huge amount of consumer data.  I guess one of the very big challenges in this field (i.e. big data processing) is that it has become extremely difficult to deal with a situation that at midnight your data processing pipeline broke for no reasons and the hadoop/spark library’s console do not give any useful hints to you neither.  This certainly will kill our sweet night if we are talking about few ten gigabytes of input data to be processed.  If the input file size is small like few kilobytes we could probably just download it to our laptop and play around with it and ultimately we will be able to solve the problem.  However, with size of tens of gigabyte, we probably would still have not much idea what the hell was going on after staying up till next day morning and spending all of your sleep hours digging out these giant but otherwise impossible-to-deal-with input files.   Recently, I encountered one of these instances…Yea..oh god…Crazy right?  To keep the story short, I figured out i could not eye-ball these big files.  Instead, I managed to write a handy script that can detect any bad lines out of all other normal lines in any given input file and once the problematic line(s) are found i then had to think of how to solve the problem.  With some lucks mixed of skills, the root clause i found was that there are some lines inside a CSV file containing some new line characters inside a double quote element.  For example, consider the following simple csv data which has just two lines:

Iphone,”{ ItemName : Cheez-It
21 Ounce}”,

It is supposed to be treated as just one single line since inside a double quote everything should be treated as a character (i.e. a content) but not anything else which has special meaning.  However, the current implementation of the Hadoop Pig library’s  getNext() fails to recognize this and it sees two lines of data.  Experience is telling me that this is certainly a bug.  So i forked off the apache-pig repo and fixed it (well, i just added the outer while loop in the whole logic technically) and then submitted a pull request to the project.

Here is the pull request link:

It goes without saying that, as being a professional software engineer, every code changes require a good unit test.  For this, i also created a new unit-test, testQuotedQuotesWithNewLines().  The attached is the screen shot of showing all existing and this new unit-test being run successfully.


Glad that i could nail this problem and also contribute back to the open-source community.  So here the July 4 long weekend here I come yayyyyy! Now May I freely glad a drink 🙂 ?

How to change default Python version in Ubuntu using VirtualEnv?

Lately, i have been doing lot of lower level system administrator tasks. I have come to realize that it is very painful to set up more than one version of the same software in the single system while you cannot just say no to these kind of requests. One good example is python. As we all know, there are at least three popular versions of the python that are actively being used: 2.7.6, 2.7.9 and 3.4.3, as of today. It is kind of funny that some python software will only garantee to work in one single version but not the other version(s). Therefore, one of the big headache is to install the different versions of the python on the same box and make them all available to different developers/applicatoins. In this tutorial, i will outline how to set these up using VirtualEnv on a blank new AWS EC2 instance that is running Linux Ubuntu just for demonstrations.

Assuming the instance is already installed with the pythong 2.7.6 as default.

Ley say we would like to install the 2.7.9 as well, here are the steps to install the 2.7.9. version first:

1. Make sure all dependencies are available by executing the following apt-get command

sudo apt-get install -y gcc-multilib g++-multilib libffi-dev libffi6 libffi6-dbg python-crypto python-mox3 python-pil python-ply libssl-dev zlib1g-dev libbz2-dev libexpat1-dev libbluetooth-dev libgdbm-dev dpkg-dev quilt autotools-dev libreadline-dev libtinfo-dev libncursesw5-dev tk-dev blt-dev libssl-dev zlib1g-dev libbz2-dev libexpat1-dev libbluetooth-dev libsqlite3-dev libgpm2 mime-support netbase net-tools bzip2

2. Download the 2.7.9. Python sources and compile and build it

tar xfz Python-2.7.9.tgz
cd Python-2.7.9/
./configure –prefix /usr/local/lib/python2.7.9
sudo make
sudo make install

2 - sudo make install
building is on the way

Now the 2.7.9. version has been installed in the system silently.  Keep in mind that, the default python version is still: 2.7.6., we can verify it if we type the python -V:

1 - python -V

Now, we need to have some meanings to switch between these two versions. We can use: VirtualEnv to do that perfectly! In the subsequent sections, I will discuss how to install VirtualEnv and use it:

4) Make sure easy-install is installed in the system first

wget -O – | sudo python

5) Install the virtualenv using easy_install

sudo easy_install virtualenv

6) Create a virtual environment with python setting to the new installed path

virtualenv –python=/usr/local/lib/python2.7.9/bin/python myPython2.7.9

4) Loading the new created environment

source myPython2.7.9/bin/activate

5) Now verify that the python 2.7.9 does load up in this virtual environment

python -V

3 - python -V

6) Good. you are now under python 2.7.9. Let say you are done and you want to quit the virtual environment, you can type ‘deactivate’ to quit and back to the normal terminal


7) Verify that we have python 2.7.6 back in the terminal

python -V

4 - deactivate python -v

Congrats! you have now learned how to use VirtualEnv to switch different versions of the python on your single machine easily.

Setting up a multiple virtual nodes cluster and running hadoop on top of it using vagrant and puppet

instructionHello, everyone! Happy Apr 4! Finally we got out of the terrible winter right? lol.  Time for some bit advanced geeky stuffs: this tutorial is a continuation of the previous post.  The goal of this tutorial is to show you how to set up a hadoop cdh5 Cloudera version running on few virtual boxes on your machine (In this tutorial, i created 1 master node and 2 slave nodes).  I have written a simple vagrant/puppet script and uploaded it to my Github repo (i.e. My vagrant-hadoop-cluster repo) account so you guys can download the code and please follow the instruction carefully to install and get it running.  Assuming you have followed my instruction, u should be able to get the cluster running like this:


And If you do see something like this, congrats! You have just successfully built a hadoop multi-node cluster on top of your machine.

Notice that, unlike other tech tutorials, I really hate to explain every syntax of one language in my tutorial since nowadays it is so easy to find lot of more complete references already online so there is no point to doubly explain (assuming all of the readers here are technically-able enough to figure out themselves the syntax) here again.   Therefore, my style is to provide the complete out of box working code so you guys figure out yourself the syntaxes and styles of Puppet.  If you guys are more curious, here is the puppet reference that i saved to my bookmark lists whenever i got stuck in the syntax issues.

Setting up a virtual machine locally using vagrant and puppet – Day 1

As a sophisticated software developer, have you ever encountered a situation that you need a blank new environment to test some software settings/installations/configurations before rolling it out to production?  A real world example is that, let say, mac book is your main development machine but you need to work on your big data pipeline and test a specific hadoop cloudera version, let say CDH5, for any compatibility issue on a linux ubuntu operating system before deploying your branch to the production environment.  It will be too much if you ask your boss to get you a new linux machine for just testing that.  With the help of Vagrant and puppet, we can download any blank pre-built operation system (in this case, the linux ubuntu) and stimulate it on your local machine( for the sake of simplicity, i will call it a workstation from now on)!   In this tutorial, i am going to show you how to how to do this easily:

To start with,

  1. download VirtualBox first at: and install it.
  2. download a version of Vagrant that suits your machine at here: and install it.
  3. To save the time, i have created a simple vagrant script and puppet script that get you started quickly: create a directory ~/vagrant-dev-test in your home directory and grab the two file: Vagrantfile, manifests/site.pp from  The directory structure now on your local should look like this:
  4. Inside the directory ~/vagrant-dev-test, type this command to bring up the virtual box:

    vagrant up

             It should look something like this in your as below:


    Sit back and relax for like 15-20mins as the new box is going to be built and the dependencies are being downloaded.

  5. Now the box should be built.  To ssh into the new box, type:

    vagrant ssh kenbox1


Congrats! You have now just built a new linux ubuntu box with help from puppet script on top on your mac workstation! In the next section which will come in next week, i will explain more the in and out of Vagrant and Puppet and some tricks on it. Stay tuned!

Typing less and less by reducing the amount of directory navigations

For all the lazy unix/mac coders: If you are like me, a super lazy unix shell users, the first enemy is usually the fatigue from your wrist and hand by having to type lot of key-strokes everyday.  In order to stay away from this natural enemy and save us some precious time to navigate back to the parent directory(ies), I wrote a small script to enable the bash shell quickly go back to a specific parent directory or N levels up instead of typing cd ../../.. redundantly.

On side node, why use bash-b but not pushd/popd all the times?
Here is the reason: pushd and popd are definitely handy in general cases regardless of where you are traversing in the whole file system. However, if you are traversing back to the parent directory, bash-back will outperform pushd/popd. For example, say, you are in aa/bb/cc/dd/ee/ff folder and if you want to traverse back to aa/bb/cc folder, it only requires one operation (i.e. ‘b cc’ or even simplier ‘b c’ as the script will do reqular expression search than exact search on the parent directory names). However, in pushd/popd use case, there are two operations involved (i.e. pushd on aa/bb/cc folder and popd it)

Anyway, for details and installation and usage, please refer to

Enjoy! Yay, less and less typing for life!

Tutorial: How to bulk import Wikipedia data into the Neo4j graph database programatically

These days i have been pretty getting my hand messed with Neo4j, the world leading graph database. In order not to screw up the production data, I think it would be nice to have a huge sample dataset imported to Neo4j and play around before mastering how Neo4j works. Therefore, i wrote a simple tool, Neo4jDataImport, in which it would first download a wikipedia sample data(small or big file is up to you to choose, the average size of the wiki dataset is around 10Gb uncompressed ) and digest it and import to Neo4j database. For details of how to build and run the program, please refer to the README file inside.

After running the program to import the wikipedia data into Neo4j database, we will then have lot of data to play around. Personally, I really love their Neo4j browser, which can be used to query and visualise the imported graph. We have to know the basic syntax of the cypher query in order to communicate the hear of the Neo4j. For example, i ran this very simple cyper query to get the node:Konitineniti on the sample wikipedia sample dataset that i just imported:

MATCH (p0:Page {title:'Konitineniti'}) -[Link]- (p:Page) RETURN p0, p


One of the very cool feature is we can see visually how this node is linked to others


Of course, I have just scatched the very superfical of the whole Neo4j features provided and its cypher query. To learn more of the query sytax, we can refer it at

The Neo4jDataImport tool can be downloaded and built from: