Blog

An overview of Neo4j - graph your data

For some decades storing data was synonym with putting it in a relational database. In fact, many people were not even aware of the existence of other types of databases. 

But triggered by research, new business demands and the availability of large amounts of data and huge processing power new ways of storing data began to make sense.
One of these ways is derived from mathematical graph theory. One of the products that succesfully implements this theory is Neo4j.

Curious about whether this will be a useful addition to the toolbox, next to old-and-proven relational databases such as MySQL, Oracle and SQL-Server I try to find out how well this works out in application development.

In this article I share some of my findings and thoughts. Since there is a lot of development going on, be aware that some of these findings might get outdated rather soon.

Probably there now is a sudden need for an introduction in the concept of graph databases and Neo4j. Luckily, the company behind Neo4j, in best Oracle tradition called Neo4j by the way, does a perfect job in this by offering a freely available o' Reilly book.
You can download it over here: Ian Robinson, Jim Webber & Emil Eifrem, Graph Databases
I will limit myself to some basis concepts and suggest that when you want to know more you download and read the book. 

Basic concepts

A graph database consists of nodes and the relations between those nodes. Neo4j is schemaless, which means that there are no model definitions. You can at any time add any node, with any kind of properties and define any relation you want.
A relation can have a direction and a label. A relation can also posses properties in which case it is called a 'qualified relation'. E.g. a person who has an influence of 0.43 on a decision. 

Where graph databases shine is in working with relations, in contrast to traditional relational databases that are strong in storing data in tables - but despite of their name not so strong in the relations between the data in those tables. When you define relations in a relational database, you need to define extra tables (for m:n relationships), key fields, indices, reference fields and so on.
You cannot do that on the fly, you have to determine beforehand what relation types are possible and define them in a schema.

As mentioned, in a graph database you just throw in nodes and relations whenever you want. A question like 'show me the relations of the relations of your friends that qualify on certain aspects' is difficult in a relational database and a lot easier in a graph database. If you are not convinced imagine a few more levels, unkown relation types and try to write some SQL for it.

Querying data

Storing data is one part, retrieving data the other. For data retrieval in relational databases there is the ubiquitous SQL - structured query language. In Neo4j, without tables, and without a schema there is no role for SQL anymore. Neo4j instead invented its own query language, called Cypher.

Cypher queries are declarative, just like SQL. Cypher will look familiair if you know SQL already, although its syntax is more object oriented. Cypher has a number of advanced tricks to query a web of relations and get results in only one or two lines of code.
Cypher is a proprietary technology of Neo4j, which is a disadvantage when you compare it to the open standardized status of SQL - although SQL syntax between products varies also. However, Neo4j has taken an 'OpenCypher'-initiative to endorse other parties to adopt Cypher as well.

SPARQL, another query language for graph databases is not supported in Neo4j. As a side note, there is  - as it goes - some attempt on github to write a library for this.

Design choices

The 'j' in Neo4j means, of course, Java. You can deploy Neo4j as a standalone server or you can embed its database engine directly inside an application.
Embedding is possible when your application is written in Java or in any other language that runs on the JVM (Java virtual machine). 

Most usage and development activity seems to take place in Java. Scala is an easy follower, because of its strong ties with Java. Then there is Python, as a popular language among people involved in data science and machine learning, it makes sense to use Neo4j. 

Embedding Neo4j

The main advantage of directly deploying Neo4j embedded in your application is speed. Another advantage is that it results in a single application that is easy to deploy.  This is suitable too for relatively light hardware such as network devices. 

The performance advantage makes embedding also very tempting in scenarios with lots of data and a high level of complexity. However, embedding also has its severe limitations. The main disadvantages are that scalibility will be difficult and that it does not adhere to a separation of concerns philosophy. You cannot share the data with other applications easily, you cannot extend your aplication across multiple servers.

In most enterprise environments these are simply no-go's. 

A Neo4j Server

If you deploy Neo4j as a server you communicate with it by a REST API. That is easy, standard and well understood. What you gain is flexibility, scalability and a free choice of your programming environment. Of course there is some overhead and latency introduced by networking and REST translating.

In the context of an enterprise using Neo4j as a server is the preferable approach as the advantages of a sound architecture outweigh by far the disadvantage of a performance penalty. Several Neo4j servers can operate together in a cluster to guarantee high availability and data reliability. 

Integration in a Java application

Moving objects from and to the database is a housekeeping task in most applications. This 'persistence boundary' often requires a lot of effort to support the basic operations known as CRUD (create, update, delete).  Alternatively, you can choose to use an 'object mapper' which can automate all these actions to a large extent.

Even more advanced environments like Java Enterprise Edition (EE) go a step further by managing an entity completely. In such an environment you don't need to save updates, and there is a caching mechanism, so every time you need a particular object, you just ask for it again.

This leads to tidy, functional code, since there is no need to 'carry objects around' for performance reasons. To gain access to Neo4j several technologies are available, ranging from traditional proven ones, to advanced new arrivals. Let me start by the new ones.

Neo4j-ogm

This is a brand new library with the ability to do object-graph mapping. It was the first thing I was looking for. It will not do entity management. It is meant to be fast and utilises the Cypher query language. Sources are available on github, and you will compile them to a jar-library file yourself or will have Maven do it for you. Available documentation is of high quality but limited in size. Some parts are still unfinished such as the way to handle user credentials for the database server. 

It would take some courage to use it right now in a 'real' project. The fact that the sources are available is a great help to solve problems you run into, but can be costly in time and adaptations will make future upgrades difficult. 

Overall, it is an interesting development that has a large potential to become successful. 

Spring-ogm

The Neo4j-ogm is a branch of the Spring-ogm project - or is it vice-versa?  So it is in more or less the same stage as the Neo4j-ogm, although I get the impression that the Spring version gets slightly more attention. I implemented the Neo4j-ogm version but skipped the Spring version.

Java EE

Somewhat surprisingly, there is on github a library with several examples of EE-support for Neo4j. It can be done! It looks promising but should at this time be considered as a proof of concept and is certainly not ready for production. Also its status is unclear, it doesn't seem to be an official supported project by Neo4j, what would be important to keep it up to date with Neo4j development.

After having mentioned these new developments there are the proven ways to get access to the Neo4j database.

REST

Using a simple REST protocol you communicate with your database server whether it is on the same machine or at the other end of the world. Cypher is the query language. Of course, working with REST is very straightforward nowadays, it involves still overhead to create the endpoints for this approach. In most cases this is the way to go.

Core Java api

When you use the embedded version you can directly talk to the database engine using a Java API. There is no need for a high level query language, and actually it is this low level approach that makes Neo4j presumably very fast. 
If you see performance comparisons keep in mind that they could have been measured using this approach.

Usage

Will we in the future abandon our traditional relational databases in favor of graph databases like Neo4j? I don't see a reason for that. If you are in an environment where a stable datamodel will do, a relational database will suit you fine. The average webshop is served well by a relational database. 

But when you have to deal with unpredictability or a large number of relations then Neo4j is the preferable choice. One of those areas is big data/ data science. 

When the object graph library becomes more mature, and no doubt it will, Neo4j becomes an even more attractive candidate for enterprise solutions.

Overname alleen na voorafgaande toestemming.