What is a graph database? A better way to store connected data


Key-value, document-oriented, column family, graph, relational… Today we seem to have as many kinds of databases as there are kinds of data. While this may make choosing a database harder, it makes choosing the right database easier. Of course, that does require doing your homework. You’ve got to know your databases. 

One of the least-understood types of databases out there is the graph database. Designed for working with highly interconnected data, a graph database might be described as more “relational” than a relational database. Graph databases shine when the goal is to capture complex relationships in vast webs of information. 

Here is a closer look at what graph databases are, why they’re unlike other databases, and what kinds of data problems they’re built to solve.

Graph database vs. relational database

In a traditional relational or SQL database, the data is organized into tables. Each table records data in a specific format with a fixed number of columns, each column with its own data type (integer, time/date, freeform text, etc.).


Again, a social network is a useful example. Graph databases reduce the amount of work needed to construct and display the data views found in social networks, such as activity feeds, or determining whether or not you might know a given person due to their proximity to other friends you have in the network.

Another application for graph databases is finding patterns of connection in graph data that would be difficult to tease out via other data representations. Fraud detection systems use graph databases to bring to light relationships between entities that might otherwise have been hard to notice. 

, originally developed for the Neo4j graph database. Since late 2015 Cypher has been developed as a separate open source project, and a number of other vendors have adopted it as a query system for their products (e.g., SAP HANA).

Here is an example of a Cypher query that returns a search result for everyone who is a friend of Scott:

MATCH (a:Person {name:’Scott’})-[:FRIENDOF]->(b)

The arrow symbol (->) is used in Cypher queries to represent a directed relationship in the graph.

Another common graph query language, , was devised for the graph computing framework. Gremlin syntax is similar to that used by some languages’ ORM database access libraries.

Here is an example of a “friends of Scott” query in Gremlin:


Many graph databases have support for Gremlin by way of a library, either built-in or third-party.

Yet another query language is . It was originally developed by the W3C to query data stored in the Resource Description Framework (RDF) format for metadata. In other words, SPARQL wasn’t devised for graph database searches, but can be used for them. On the whole, Cypher and Gremlin have been more broadly adopted.

have some elements reminiscent of SQL, namely SELECT and WHERE clauses, but the rest of the syntax is radically dissimilar. Don’t think of SPARQL as being related to SQL at all, or for that matter to other graph query languages.

Popular graph databases

Because graph databases serve a relatively niche use case, there aren’t nearly as many of them as there are relational databases. On the plus side, that makes the standout products easier to identify and discuss.


is easily the most mature (11 years and counting) and best-known of the graph databases for general use. Unlike previous graph database products, it doesn’t use a SQL back-end. Neo4j is a native graph database that was engineered from the inside out to support large graph structures, as in queries that return hundreds of thousands of relations and more.

Neo4j comes in both free open-source and for-pay enterprise editions, with the latter having no restrictions on the size of a dataset (). You can also experiment with Neo4j online by way of its , which includes some sample datasets to practice with.

See for more details.

Microsoft Azure Cosmos DB

The cloud database is an ambitious project. It’s intended to emulate multiple kinds of databases—conventional tables, document-oriented, column family, and graph—all through a single, unified service with a consistent set of APIs.

To that end, a graph database is just one of the various modes Cosmos DB can operate in. It uses the Gremlin query language and API for graph-type queries, and supports the Gremlin console created for Apache TinkerPop as another interface.

Another big selling point of Cosmos DB is that indexing, scaling, and geo-replication are handled automatically in the Azure cloud, without any knob-twiddling on your end. It isn’t clear yet how Microsoft’s all-in-one architecture measures up to native graph databases in terms of performance, but Cosmos DB certainly offers a useful combination of flexibility and scale.

See for more details.


was , and is now under the governance of the Linux Foundation. It uses any of a number of supported back ends—Apache Cassandra, Apache HBase, Google Cloud Bigtable, Oracle BerkeleyDB—to store graph data, supports the Gremlin query language (as well as other elements from the Apache TinkerPop stack), and can also incorporate full-text search by way of the Apache Solr, Apache Lucene, or Elasticsearch projects.

IBM, one of the JanusGraph project’s supporters, offers a hosted version of JanusGraph on IBM Cloud, called . Like Azure Cosmos DB, Compose for JanusGraph provides autoscaling and high availability, with pricing based on resource usage.