Tutorial: How to bulk import Wikipedia data into the Neo4j graph database programatically

These days i have been pretty getting my hand messed with Neo4j, the world leading graph database. In order not to screw up the production data, I think it would be nice to have a huge sample dataset imported to Neo4j and play around before mastering how Neo4j works. Therefore, i wrote a simple tool, Neo4jDataImport, in which it would first download a wikipedia sample data(small or big file is up to you to choose, the average size of the wiki dataset is around 10Gb uncompressed ) and digest it and import to Neo4j database. For details of how to build and run the program, please refer to the README file inside.

After running the program to import the wikipedia data into Neo4j database, we will then have lot of data to play around. Personally, I really love their Neo4j browser, which can be used to query and visualise the imported graph. We have to know the basic syntax of the cypher query in order to communicate the hear of the Neo4j. For example, i ran this very simple cyper query to get the node:Konitineniti on the sample wikipedia sample dataset that i just imported:

MATCH (p0:Page {title:'Konitineniti'}) -[Link]- (p:Page) RETURN p0, p


One of the very cool feature is we can see visually how this node is linked to others


Of course, I have just scatched the very superfical of the whole Neo4j features provided and its cypher query. To learn more of the query sytax, we can refer it at http://neo4j.com/docs/stable/cypher-query-lang.html

The Neo4jDataImport tool can be downloaded and built from: