I cannot imagine it has been 8 months already since my last post of building a Hadoop cluster. Anyway, have you ever gone through some pains of setting up a single standalone Spark instance on your local machine at all and dealing with lot of configuration issues these kind of pains? If yes, it is gonna be more hassle for the multi-node Spark version! Now no more I can assure that, as with the Spark big data framework has been gaining more popularity, I am going to prepare a Vagrant-Puppet repo for Spark in order to demonstrate how to build one multi-node Spark cluster locally on your work station (In this tutorial, I will build a one Spark master node and two Spark worker node.). The benefit of having a test cluster locally can speed up software development and testing cycles without, needless to say.
The fact that it failed to recognize and parse a particular VALID csv file format: It causes tons of our csv files from our data partners failed to be processed due to this bug. I had to fix it in order to get our big data pipeline process otherwise we will loss big bucks. For more technical details of the fix, please refer to:
Lately, i have been doing lot of lower level system administrator tasks. I have come to realize that it is very painful to set up more than one version of the same software in the single system while you cannot just say no to these kind of requests. One good example is python. As we all know, there are at least three popular versions of the python that are actively being used: 2.7.6, 2.7.9 and 3.4.3, as of today. It is kind of funny that some python software will only garantee to work in one single version but not the other version(s). Therefore, one of the big headache is to install the different versions of the python on the same box and make them all available to different developers/applicatoins. In this tutorial, i will outline how to set these up using VirtualEnv on a blank new AWS EC2 instance that is running Linux Ubuntu just for demonstrations.
Assuming the instance is already installed with the pythong 2.7.6 as default.
Ley say we would like to install the 2.7.9 as well, here are the steps to install the 2.7.9. version first:
1. Make sure all dependencies are available by executing the following apt-get command
instructionHello, everyone! Happy Apr 4! Finally we got out of the terrible winter right? lol. Time for some bit advanced geeky stuffs: this tutorial is a continuation of the previous post. The goal of this tutorial is to show you how to set up a hadoop cdh5 Cloudera version running on few virtual boxes on your machine (In this tutorial, i created 1 master node and 2 slave nodes). I have written a simple vagrant/puppet script and uploaded it to my Github repo (i.e. My vagrant-hadoop-cluster repo) account so you guys can download the code and please follow the instruction carefully to install and get it running. Assuming you have followed my instruction, u should be able to get the cluster running like this:
And If you do see something like this, congrats! You have just successfully built a hadoop multi-node cluster on top of your machine.
Notice that, unlike other tech tutorials, I really hate to explain every syntax of one language in my tutorial since nowadays it is so easy to find lot of more complete references already online so there is no point to doubly explain (assuming all of the readers here are technically-able enough to figure out themselves the syntax) here again. Therefore, my style is to provide the complete out of box working code so you guys figure out yourself the syntaxes and styles of Puppet. If you guys are more curious, here is the puppet reference that i saved to my bookmark lists whenever i got stuck in the syntax issues.
This past weekend I spent most of my time trying to get my dev environment mostly setup and knowlege polished up. One of the thing is to build the Neo4j (it is the by far the most popular and leading graph database) source code on my machine and study internal how it works, mostly because i would like to get myself prepared at the projects from my work that are extensively dealing and analyzing with billion of data nodes currently in the production environments and of course, if i have time, i would like to make some bug fixes and enchancment contributions to its repositority. Well, anyway, as always, the best way to get a feel how a thing works is to watch how a piece of data coming in and out -> Yes, the debugger again! In this article I will give a tutorial how to set this up on the Neo4j server community version on my local.
(Note that in order to get the most out from this tutorial, it is assumed that you have already had some basic idea of using Neo4j and its WebAdmin tools. I strongly encourage you to spend some time to take a look at their developer site and learn it now: http://neo4j.com/developer/guide-neo4j-browser/)
Step 1) Download the Neo4j Community Edition (2.1.5 as of today) at http://neo4j.com/download/ and follow the instrcutions to extract it.
Step 2) Assuming it is extracted to the /usr/local/neo4j-community-2.1.5, open the conf/neo4j-wrapper.conf file and add the following line:
Step 4) (Recommended) Since the Neo4j version i downloaded in above was 2.1.5., in order to make the source code consistent to the one i downloaded in step 1, I need to switch the working branch to 2.1 maint as well by issuing the following command:
git checkout -b 2.1-maint origin/2.1-maint
Note that the 2.1-maint was from their github site:
Step 5) Now import all Neo4j source codes (actually they are all the maven projects) into the Eclipse
Step 6) In Eclipse, create the remote debug configuration and has it listen to the port 5005 in which we specified in Step 2
Step 7) Run the configuration that you just created in Step 6)
Step 8) Now we can set the break point anywhere. Let say we set it at the JmxService.java’s getBean() method and if we go to the http://localhost:7474/webadmin