Fixed the ‘new line’ character inside double-quote causing the csv parsing failure

The nature of my work, as being a big data architect, is to deal with lot of  huge amount of consumer data.  I guess one of the very big challenges in this field (i.e. big data processing) is that it has become extremely difficult to deal with a situation that at midnight your data processing pipeline broke for no reasons and the hadoop/spark library’s console do not give any useful hints to you neither.  This certainly will kill our sweet night if we are talking about few ten gigabytes of input data to be processed.  If the input file size is small like few kilobytes we could probably just download it to our laptop and play around with it and ultimately we will be able to solve the problem.  However, with size of tens of gigabyte, we probably would still have not much idea what the hell was going on after staying up till next day morning and spending all of your sleep hours digging out these giant but otherwise impossible-to-deal-with input files.   Recently, I encountered one of these instances…Yea..oh god…Crazy right?  To keep the story short, I figured out i could not eye-ball these big files.  Instead, I managed to write a handy script that can detect any bad lines out of all other normal lines in any given input file and once the problematic line(s) are found i then had to think of how to solve the problem.  With some lucks mixed of skills, the root clause i found was that there are some lines inside a CSV file containing some new line characters inside a double quote element.  For example, consider the following simple csv data which has just two lines:

Iphone,”{ ItemName : Cheez-It
21 Ounce}”,

It is supposed to be treated as just one single line since inside a double quote everything should be treated as a character (i.e. a content) but not anything else which has special meaning.  However, the current implementation of the Hadoop Pig library’s  getNext() fails to recognize this and it sees two lines of data.  Experience is telling me that this is certainly a bug.  So i forked off the apache-pig repo and fixed it (well, i just added the outer while loop in the whole logic technically) and then submitted a pull request to the project.

Here is the pull request link: https://github.com/apache/pig/pull/20

It goes without saying that, as being a professional software engineer, every code changes require a good unit test.  For this, i also created a new unit-test, testQuotedQuotesWithNewLines().  The attached is the screen shot of showing all existing and this new unit-test being run successfully.

unit-test

Glad that i could nail this problem and also contribute back to the open-source community.  So here the July 4 long weekend here I come yayyyyy! Now May I freely glad a drink 🙂 ?

Advertisements

Tutorial – Python unit test with Eclipse (1)

In this article, I will walk through how to set up and do a python unit test with Eclipse.

Prerequisite: Pydev has been installed in Eclipse. If not, please open up the Eclipse and go to: Help -> Eclipse MarketPlace and search for ‘PyDev’ and install it as below

1
Now, we are ready to create a python unit test. To start with, let’s create a new PyDev Project for holding the project source and the unit tests.

1) We go to: File -> New, in the New window, choose PyDev Project as below

2

and give the project a name, such as TestPython

2) In the project TestPython, create a new python module on top of it in order to have all our unit tests placed inside this model.

3

and give the package as test and name as testCalculator

3) We now have the generated package: test and two files created. In theory, we can put as many unit tests in this package. In the testCalculator.py, put the following code there
import unittest

class TestCalc(unittest.TestCase):

def testAdd(self):
print("it is a test")
result = True
self.assertEqual(result, True, "Ohno")

4

Basically, in the above code, we just created a test class: TestCalc which extends the unittest.TestCase as the base class and the TestCalc class will have all the testing API available such as self.assertEqual..etc. For more information of the API of unittest, please feel free to refer: https://docs.python.org/3/library/unittest.html

4) The last step would be to just right click the testCalculator.py in the Package Explorer and choose Run As -> Python unit-test. If the setup is correct, we should see something like this:

5

And yes, congrats! you have just created the basic python unit test!

Stay tuned and more to come (such as running it in the command line instead) later!