Fixed the ‘new line’ character inside double-quote causing the csv parsing failure

The nature of my work, as being a big data architect, is to deal with lot of  huge amount of consumer data.  I guess one of the very big challenges in this field (i.e. big data processing) is that it has become extremely difficult to deal with a situation that at midnight your data processing pipeline broke for no reasons and the hadoop/spark library’s console do not give any useful hints to you neither.  This certainly will kill our sweet night if we are talking about few ten gigabytes of input data to be processed.  If the input file size is small like few kilobytes we could probably just download it to our laptop and play around with it and ultimately we will be able to solve the problem.  However, with size of tens of gigabyte, we probably would still have not much idea what the hell was going on after staying up till next day morning and spending all of your sleep hours digging out these giant but otherwise impossible-to-deal-with input files.   Recently, I encountered one of these instances…Yea..oh god…Crazy right?  To keep the story short, I figured out i could not eye-ball these big files.  Instead, I managed to write a handy script that can detect any bad lines out of all other normal lines in any given input file and once the problematic line(s) are found i then had to think of how to solve the problem.  With some lucks mixed of skills, the root clause i found was that there are some lines inside a CSV file containing some new line characters inside a double quote element.  For example, consider the following simple csv data which has just two lines:

Iphone,”{ ItemName : Cheez-It
21 Ounce}”,

It is supposed to be treated as just one single line since inside a double quote everything should be treated as a character (i.e. a content) but not anything else which has special meaning.  However, the current implementation of the Hadoop Pig library’s  getNext() fails to recognize this and it sees two lines of data.  Experience is telling me that this is certainly a bug.  So i forked off the apache-pig repo and fixed it (well, i just added the outer while loop in the whole logic technically) and then submitted a pull request to the project.

Here is the pull request link: https://github.com/apache/pig/pull/20

It goes without saying that, as being a professional software engineer, every code changes require a good unit test.  For this, i also created a new unit-test, testQuotedQuotesWithNewLines().  The attached is the screen shot of showing all existing and this new unit-test being run successfully.

unit-test

Glad that i could nail this problem and also contribute back to the open-source community.  So here the July 4 long weekend here I come yayyyyy! Now May I freely glad a drink 🙂 ?

Setting up a virtual machine locally using vagrant and puppet – Day 1

As a sophisticated software developer, have you ever encountered a situation that you need a blank new environment to test some software settings/installations/configurations before rolling it out to production?  A real world example is that, let say, mac book is your main development machine but you need to work on your big data pipeline and test a specific hadoop cloudera version, let say CDH5, for any compatibility issue on a linux ubuntu operating system before deploying your branch to the production environment.  It will be too much if you ask your boss to get you a new linux machine for just testing that.  With the help of Vagrant and puppet, we can download any blank pre-built operation system (in this case, the linux ubuntu) and stimulate it on your local machine( for the sake of simplicity, i will call it a workstation from now on)!   In this tutorial, i am going to show you how to how to do this easily:

To start with,

  1. download VirtualBox first at: https://www.virtualbox.org/wiki/Downloads and install it.
  2. download a version of Vagrant that suits your machine at here: https://www.vagrantup.com/downloads.html and install it.
  3. To save the time, i have created a simple vagrant script and puppet script that get you started quickly: create a directory ~/vagrant-dev-test in your home directory and grab the two file: Vagrantfile, manifests/site.pp from https://github.com/wwken/Misc_programs/tree/master/Vagrant-Puppet/vagrant-dev-test.  The directory structure now on your local should look like this:
    directorystructure
  4. Inside the directory ~/vagrant-dev-test, type this command to bring up the virtual box:

    vagrant up

             It should look something like this in your as below:

    screen-vagrant-up

    Sit back and relax for like 15-20mins as the new box is going to be built and the dependencies are being downloaded.

  5. Now the box should be built.  To ssh into the new box, type:

    vagrant ssh kenbox1

    screen-inside-virtual-box

Congrats! You have now just built a new linux ubuntu box with help from puppet script on top on your mac workstation! In the next section which will come in next week, i will explain more the in and out of Vagrant and Puppet and some tricks on it. Stay tuned!