Building a multi-node Spark cluster on your local machine using vagrant and puppet

I cannot imagine it has been 8 months already since my last post of building a Hadoop cluster.  Anyway, have you ever gone through some pains of setting up a single standalone Spark instance on your local machine at all and dealing with lot of configuration issues these kind of pains?  If yes, it is gonna be more hassle for the multi-node Spark version!  Now no more I can assure that, as with the Spark big data framework has been gaining more popularity, I am going to prepare a Vagrant-Puppet repo for Spark in order to demonstrate how to build one multi-node Spark cluster locally on your work station (In this tutorial, I will build a one Spark master node and two Spark worker node.). The benefit of having a test cluster locally can speed up software development and testing cycles without, needless to say.

For details, please go to:


Bit taste of a small accomplishment

Today, the patch of my Apache Pig got successfully recognized and merged into their PIG trunk repo! yeah

The fact that it failed to recognize and parse a particular VALID csv file format: It causes tons of our csv files from our data partners failed to be processed due to this bug. I had to fix it in order to get our big data pipeline process otherwise we will loss big bucks.  For more technical details of the fix, please refer to:

How to change default Python version in Ubuntu using VirtualEnv?

Lately, i have been doing lot of lower level system administrator tasks. I have come to realize that it is very painful to set up more than one version of the same software in the single system while you cannot just say no to these kind of requests. One good example is python. As we all know, there are at least three popular versions of the python that are actively being used: 2.7.6, 2.7.9 and 3.4.3, as of today. It is kind of funny that some python software will only garantee to work in one single version but not the other version(s). Therefore, one of the big headache is to install the different versions of the python on the same box and make them all available to different developers/applicatoins. In this tutorial, i will outline how to set these up using VirtualEnv on a blank new AWS EC2 instance that is running Linux Ubuntu just for demonstrations.

Assuming the instance is already installed with the pythong 2.7.6 as default.

Ley say we would like to install the 2.7.9 as well, here are the steps to install the 2.7.9. version first:

1. Make sure all dependencies are available by executing the following apt-get command

sudo apt-get install -y gcc-multilib g++-multilib libffi-dev libffi6 libffi6-dbg python-crypto python-mox3 python-pil python-ply libssl-dev zlib1g-dev libbz2-dev libexpat1-dev libbluetooth-dev libgdbm-dev dpkg-dev quilt autotools-dev libreadline-dev libtinfo-dev libncursesw5-dev tk-dev blt-dev libssl-dev zlib1g-dev libbz2-dev libexpat1-dev libbluetooth-dev libsqlite3-dev libgpm2 mime-support netbase net-tools bzip2

2. Download the 2.7.9. Python sources and compile and build it

tar xfz Python-2.7.9.tgz
cd Python-2.7.9/
./configure –prefix /usr/local/lib/python2.7.9
sudo make
sudo make install

2 - sudo make install
building is on the way

Now the 2.7.9. version has been installed in the system silently.  Keep in mind that, the default python version is still: 2.7.6., we can verify it if we type the python -V:

1 - python -V

Now, we need to have some meanings to switch between these two versions. We can use: VirtualEnv to do that perfectly! In the subsequent sections, I will discuss how to install VirtualEnv and use it:

4) Make sure easy-install is installed in the system first

wget -O – | sudo python

5) Install the virtualenv using easy_install

sudo easy_install virtualenv

6) Create a virtual environment with python setting to the new installed path

virtualenv –python=/usr/local/lib/python2.7.9/bin/python myPython2.7.9

4) Loading the new created environment

source myPython2.7.9/bin/activate

5) Now verify that the python 2.7.9 does load up in this virtual environment

python -V

3 - python -V

6) Good. you are now under python 2.7.9. Let say you are done and you want to quit the virtual environment, you can type ‘deactivate’ to quit and back to the normal terminal


7) Verify that we have python 2.7.6 back in the terminal

python -V

4 - deactivate python -v

Congrats! you have now learned how to use VirtualEnv to switch different versions of the python on your single machine easily.

Setting up a multiple virtual nodes cluster and running hadoop on top of it using vagrant and puppet

instructionHello, everyone! Happy Apr 4! Finally we got out of the terrible winter right? lol.  Time for some bit advanced geeky stuffs: this tutorial is a continuation of the previous post.  The goal of this tutorial is to show you how to set up a hadoop cdh5 Cloudera version running on few virtual boxes on your machine (In this tutorial, i created 1 master node and 2 slave nodes).  I have written a simple vagrant/puppet script and uploaded it to my Github repo (i.e. My vagrant-hadoop-cluster repo) account so you guys can download the code and please follow the instruction carefully to install and get it running.  Assuming you have followed my instruction, u should be able to get the cluster running like this:


And If you do see something like this, congrats! You have just successfully built a hadoop multi-node cluster on top of your machine.

Notice that, unlike other tech tutorials, I really hate to explain every syntax of one language in my tutorial since nowadays it is so easy to find lot of more complete references already online so there is no point to doubly explain (assuming all of the readers here are technically-able enough to figure out themselves the syntax) here again.   Therefore, my style is to provide the complete out of box working code so you guys figure out yourself the syntaxes and styles of Puppet.  If you guys are more curious, here is the puppet reference that i saved to my bookmark lists whenever i got stuck in the syntax issues.

Tutorial – How to debug the Neo4j graph database locally using Eclipse

This past weekend I spent most of my time trying to get my dev environment mostly setup and knowlege polished up. One of the thing is to build the Neo4j (it is the by far the most popular and leading graph database) source code on my machine and study internal how it works, mostly because i would like to get myself prepared at the projects from my work that are extensively dealing and analyzing with billion of data nodes currently in the production environments and of course, if i have time, i would like to make some bug fixes and enchancment contributions to its repositority. Well, anyway, as always, the best way to get a feel how a thing works is to watch how a piece of data coming in and out -> Yes, the debugger again! In this article I will give a tutorial how to set this up on the Neo4j server community version on my local.

(Note that in order to get the most out from this tutorial, it is assumed that you have already had some basic idea of using Neo4j and its WebAdmin tools. I strongly encourage you to spend some time to take a look at their developer site and learn it now:

Step 1) Download the Neo4j Community Edition (2.1.5 as of today) at and follow the instrcutions to extract it.

Step 2) Assuming it is extracted to the /usr/local/neo4j-community-2.1.5, open the conf/neo4j-wrapper.conf file and add the following line: -Xnoagent -Xrunjdwp:transport=dt_socket,address=5005,server=y,suspend=y

Step 3) Download the Neo4j source codes from its GitHub account at

git clone

Step 4) (Recommended) Since the Neo4j version i downloaded in above was 2.1.5., in order to make the source code consistent to the one i downloaded in step 1, I need to switch the working branch to 2.1 maint as well by issuing the following command:

git checkout -b 2.1-maint origin/2.1-maint

Note that the 2.1-maint was from their github site:

Step 5) Now import all Neo4j source codes (actually they are all the maven projects) into the Eclipse

Step 6) In Eclipse, create the remote debug configuration and has it listen to the port 5005 in which we specified in Step 2


Step 7) Run the configuration that you just created in Step 6)

Step 8) Now we can set the break point anywhere. Let say we set it at the’s getBean() method and if we go to the http://localhost:7474/webadmin


We should have the the script paused as below: