Fixed the ‘new line’ character inside double-quote causing the csv parsing failure

The nature of my work, as being a big data architect, is to deal with lot of  huge amount of consumer data.  I guess one of the very big challenges in this field (i.e. big data processing) is that it has become extremely difficult to deal with a situation that at midnight your data processing pipeline broke for no reasons and the hadoop/spark library’s console do not give any useful hints to you neither.  This certainly will kill our sweet night if we are talking about few ten gigabytes of input data to be processed.  If the input file size is small like few kilobytes we could probably just download it to our laptop and play around with it and ultimately we will be able to solve the problem.  However, with size of tens of gigabyte, we probably would still have not much idea what the hell was going on after staying up till next day morning and spending all of your sleep hours digging out these giant but otherwise impossible-to-deal-with input files.   Recently, I encountered one of these instances…Yea..oh god…Crazy right?  To keep the story short, I figured out i could not eye-ball these big files.  Instead, I managed to write a handy script that can detect any bad lines out of all other normal lines in any given input file and once the problematic line(s) are found i then had to think of how to solve the problem.  With some lucks mixed of skills, the root clause i found was that there are some lines inside a CSV file containing some new line characters inside a double quote element.  For example, consider the following simple csv data which has just two lines:

Iphone,”{ ItemName : Cheez-It
21 Ounce}”,

It is supposed to be treated as just one single line since inside a double quote everything should be treated as a character (i.e. a content) but not anything else which has special meaning.  However, the current implementation of the Hadoop Pig library’s  getNext() fails to recognize this and it sees two lines of data.  Experience is telling me that this is certainly a bug.  So i forked off the apache-pig repo and fixed it (well, i just added the outer while loop in the whole logic technically) and then submitted a pull request to the project.

Here is the pull request link: https://github.com/apache/pig/pull/20

It goes without saying that, as being a professional software engineer, every code changes require a good unit test.  For this, i also created a new unit-test, testQuotedQuotesWithNewLines().  The attached is the screen shot of showing all existing and this new unit-test being run successfully.

unit-test

Glad that i could nail this problem and also contribute back to the open-source community.  So here the July 4 long weekend here I come yayyyyy! Now May I freely glad a drink 🙂 ?