Friday, March 28, 2008

Graphs: A Better Database Abstraction

I wrote about the power of abstractions in January, but based on some of the recent database discussion I think its time to discuss the subject again.

I love abstractions.

The essence of the concept of an abstraction is a framework that simplifies how you think about and work in a given domain. Abstractions can be (and often are) argued against by suggesting that you don’t really need them. In computer programming, we didn’t need C because we had assembly language. We didn’t need C++ because we had C. We didn’t need Java because we had C++. To me these arguments (which people really made) were silly. I abandoned assembly language in the 90's.

The point is none of our existing abstractions are *needed*. But our human brains can only manage a certain amount of complexity at a time. Complexity is fine but only in bite size chunks. Abstractions are, in essence, a really generalized form of user interface. I think the reason I have written so much about user interface is that I think it is so important to figure out how to map complex things into representational models in such a way that more people can access them. Abstractions allow us to do just that with any process we are engaged in. Abstractions allow us to encapsulate complexity so that we don’t have to think about it and we can achieve greater and greater levels of complexity in an efficient way allowing us to keep more of the model of a given system in our heads.

And so the anti-abstraction argument rears its head in the RDBMS vs graph database debate. One of the arguments that the pro RDBMS folks make for why there is no need for the graph database model is that you can do everything that can be expressed in a graph database in a relational database. And there is some truth to this.

But there are two problems with this argument. The first is that this is only true in theory. It is not possible to build a graph database of scale using pure SQL – at least with the SQL tools that we currently have to choose from.

One reason for the scale problem is the only way to do it is to do what are called self-joins. This is when you join a table to itself. Conceptually seem just fine. But the problem is that it is impossible for the database engine to do anything other than brute force un-optimized traversals of the graph when confronted with a chain of self-joins. In other words, using this technique will not yield a useful database that is query-able at any kind of scale. Handling certain aspects of providing a graph database model requires some very specific and different kind of thinking and optimizations from those that go into designing an SQL database.

Another problem is that one giant table using self joins for traversal means a huge write bottleneck. Yes, you can avoid that with sharding depending on your design, but it is definitely not part of the SQL model, and so you can’t say SQL is helping you here.

The second and I believe more important argument against implementing a graph data model using SQL is that even if SQL could do a good job of representing a graph model, building your graph system in SQL is not a very good abstraction. The truth is that most of the kinds of things we want to do in app development look more like graph than relational structures. Graphs are elemental to computer science because most interesting algorithms and in fact real world data models can be very naturally thought of as a graphs. Graph theory is (if things are as they were when I was in school) the first thing you learn when you begin studying computer science, and there is very good reason for this. The fact that Facebook was able to anchor the idea of what they were building as a “social graph” is an incredible testament to the innately natural characteristics of the graph concept.

So if you are representing a graph, you really want an API that reflects the unique and useful characteristics of a graph. In other words, you want an abstraction that reflects how you really think about the data and not some jury-rigged representational model continuously intruding itself into your thought process. And so, having a data store that allows us to express our data in a way that is much more similar to how we actually think is enormously helpful.

And such is the case with attempting to implement a graph database using SQL. You can do it, but it is unlikely to work very well, and because you don’t have the benefit of the abstraction, it actually adds to the complexity of the design instead of simplifying it.

The bottom line is that graphs are a better representational model when the structure of your system will change frequently. Relational is a better model when the structure will be static. Today, I think most of us are not building applications that are ideally structurally static.

Because most applications today have a much more dynamic nature, graphs are, for most people, under most circumstances, a far better abstraction. And to me, there is little in this world more powerful and satisfying than a great abstraction.

19 comments:

  1. Yep. The thing that naysayers don't get is that working at a higher level of abstraction can make complex tasks much easier to do.

    Yes, you can code a GUI in assembler, or C, or C++. But it's much easier and more pleasant to do in Java, since you don't have to deal with machine instructions, or manual memory management, or manually painting textboxes and drop-shadows on the screen. Java does all those things for you under the hood, so that you can just deal with the simple abstraction of "put a text box here".

    And similarly, since Java is just a general-purpose programming language with an add-on library that handles the GUI abstractions (Swing), no doubt there's an even better language that could be conceived for writing GUI's that would be at a higher level of abstraction than Java code, and would therefore be easier to code in, more concise, and more powerful.

    As this pertains to graph databases, the best support I can see for this is in a couple of the comments to your "Death" article over on Hacker News (http://news.ycombinator.com/item?id=147124):


    3 points by staticshock 1 day ago | link

    The problem with RDBMS approaches is that the good ones assume you can pack your complex logic into a monster query or stored procedure and let the query optimizer do its thing. But if you're implementing an attribute-value system or graph traversal on top of an SQL database, you end up generating a ginormous number of queries just to do some basic traversal.


    1 point by staticshock 1 day ago | link

    Now, I may be pretty naive here, but if you're doing full on graph traversal, why not just extract the full graph from the database and traverse it in memory on your own terms instead of leaving it to large unoptimized traversal queries?


    2 points by wheels 1 day ago | link

    For the latest data set that I'm working on there are 5 million nodes and 50 million edges, and each one has some meta-data associated with it. :-)



    Translation:

    Can't you traverse a graph using just SQL? Yes, but it's a huge pain, since you need to use "a ginormous" number of queries to do it.

    Can't you just use a workaround that makes it easier?

    Not any workaround that scales, no.


    I have to say, Hank, I'm curious to see what implementation you're going to come up with that's going to address these issues.

    ReplyDelete
  2. Yes, where do you get this magical graph database today? Wikipedia articles say its an "older model" but you dont really see any software packages for them.

    ReplyDelete
  3. I apologize if I'm touting my own horn too much here. Hank: please let me know if you feel I'm hijacking your comments to plug my own technology and I'll shut up.

    But my company's product is a graph database. It's available for download as open source (AGPL) here. It's written in Java. It scales to billions of nodes, yet is a small Java library (jar file <500k). It's fully transactional (two-phase commits, transaction recovery, the whole shebang). It's natively graph -- filesystem layout, storage manager, etc, all custom-written for high performance graph traversals. We think it's pretty cool and will launch it commercially very soon.

    We have three core abstractions for representation: node, relationship (connects two nodes) and property (key-value pair that can be attached to both nodes and relationships). We have a few more for traversal. But that's it. The javadocs are online here.

    Would be very interesting to hear some comments. (We do know that the documentation is in a poor shape. Working on it tho!)

    Emil Eifrem
    http://neotechnology.com

    ReplyDelete
  4. @anonymous

    There is a whole industry of object databases and has been for well over a decade. There are also several open source products.

    http://www.odbms.org/

    ReplyDelete
  5. Graph theory as the basis for databases is really a pretty old idea that only looks new because it's been buried for a long time. If my grandmother found a Mac Plus at a garage sale she might find it pretty cool and amazing, but no one who has used a computer since 1984 would be impressed.

    I won't refer you to Chris Date again, except to say that he has covered this topic at length. You may not be aware of parallel work in this regard that was prompted by XML and attempts to make XML-based "databases." Fabian Pascal has some cross words here on his Database Debunkings site (which Chris Date and Hugh Darwen contribute to).

    ReplyDelete
  6. Greg,

    You seem intent on proving things by making references to irrelevant writings. XML is indeed a graph, because in fact almost every data structure is a graph. However all graphs are not represented as a tree (which is what XML is).

    More generally, graphs are not a fad any more than water is. The concept of framing a graph as such is silly. But then thats not what the article says. But I dont want to get into defending something that has nothing to do with what I am talking about. By the way, and object database is not a generalized graph database either, though it is indeed navigational. It certainly has no concept of "edges" vs "nodes" for example. In fact neither would an XML database, at least in a generalized sense. It appears that you dont really understand the arguements that you are making but seem satisfied to quote others without understanding what either they or I am saying - or perhaps just not reading carefully enough.

    ReplyDelete
  7. Ah, finally someone agrees with me. =) I thought I was the only relational database hater out there, because everyone seems to be content with MySQL and all those.

    Anyway, about your post, I agree with the fact that graphs represent most data these days much better than a relational database ever could. That being said, I disagree with the fact that we can use abstraction to work around it. You can use an abstraction framework all you want, and it will surely make your life simpler, but the database at the heart of the application will still be relational, and I think that's where the problem lies.

    Personally, I think no matter how you abstract it, the problem is still there. The problem, at least in my opinion, is in scalability. Abstraction can help with the coding, but it isn't going to make your database run faster. A relational database using things like self-joins simply can't scale as well as a navigational database that was meant for the task.

    Now, there are lots of navigational databases out there, but then the problem is that they don't follow that graph pattern. So no matter what, you're stuck having to create an abstraction layer over the database.

    Basically, my point is that, while abstraction is a great thing, having a database that removes the need for abstraction would be even better. I'd much rather see an actual graph database that does everything I need it to (and well, for that matter) than an abstraction framework that just turns the relational data into a graph.

    I hope that makes at least some sense.

    ReplyDelete
  8. great, great, great article - you laid down what i've been thinking for some time. I want to underline a question presented earlier - can you recommend a good graph data store - i know there's alot of object databases out there, and maybe one of those will do the trick - the closest thing i've found thus far is graphd http://blog.freebase.com/2008/04/09/a-brief-tour-of-graphd/ from freebase - i'd love to find something similar - preferably opensource and really simple

    ReplyDelete
  9. Good article. I too love abstractions. While I've programmed in all the languages you've mentioned: asm, c, c++ and java... you will find that software is always becoming more and more abstract.

    As they say, 4 out of 5 mathematicians perfer graphs over tables.

    ReplyDelete
  10. I think you've missed the point of abstraction in database design. In a business environment that is constantly changing, full of people looking for new ways to utilize the data resource, an abstract model provides a foundation that does not pre-suppose the eventual use of the data.

    This means the use of the data can change over time without affecting the database design or the current clients of the database. It also means new functionality (required to meet those changing business needs) can be easily be supported, again without changing the database design or affecting current clients.

    A statement such as:

    The bottom line is that graphs are a better representational model when the structure of your system will change frequently. Relational is a better model when the structure will be static

    is just silly. You cannot possibly prove a statement like that, and in fact, my own experience has proven a properly designed and executed abstract relational data model can easily support change.

    Randy

    ReplyDelete
  11. Randy,

    "my own experience has proven a properly designed and executed abstract relational data model can easily support change."

    No offense, but if you really believe this then your own experience has not served you well. Schema migration is the bane of all database design.

    ReplyDelete
  12. "... none of our existing abstractions are *needed*. But our human brains can only manage a certain amount of complexity at a time. Complexity is fine but only in bite size chunks."

    Wonderfully said :)

    About the graph database debate... Maybe somebody is still at the "relational database" step :)

    ReplyDelete
  13. This is a good post and I thought many of the comments were good as well. Personally, I'm not ready to give up on the relational model. I just wish current implementations weren't so bad a handling same-type hierarchies. And as you point, they really are... in addition to the other problems mentioned one also needs a programmer, rather than a typical database guy, to make deep use of these self-join structures. You also mentioned a write bottleneck, which is an argument I wish I had thought of in a past debate.

    My understanding is that many graph databases avoid schema altogether, and I don't look at this as an advantage. I'd think that if graph databases were to impose schema, it wouldn't be particularly easier to upgrade them.

    ReplyDelete
  14. Is there a difference I am missing between the NoSQL movement and the graph database concept?

    I am beginning to embrace the schema-less database concept, especially since it would allow new types of data to be stored and linked without changing the structure of the database or the code that accesses the database. However, I am having trouble figuring out how to build certain known structures using a schema-less tools. Such as a filesystem folder hierarchy or a Bill of Material hierarchy (there is a difference between these two hierarchies) Also efficient handling of a 2D table. Does anyone have a link to a good reference of how different structures are built (represented) in a schema-less database?

    I see three kinds of links. The first, I call vertical links. (parent/child) I guess these are called directed links in graph parlance. The second, I call horizontal links. I think of these as synonyms, or things to be included in a set, but not necessarily things that are parents or children. The third, I call temporal links. These allow links through time as a piece of data is modified. These links let you work forward to the "latest" version of a piece of data, or copy of a file.

    I also think there is a need for something I call an "intersection". This is useful when more than one piece of data needs to be joined to form a useful key. I work in a manufacturing environment, so I deal with part numbers and revision letters. I have a need to attach certain pieces of data, and files, to an intersection of a part number and a revision letter. later, I need to attach different data and files to the intersection of the same part number and a newer revision letter.

    ReplyDelete
  15. No offense, but for a professed lover of complexity, this suffers massive reductive oversimplification.

    ReplyDelete
  16. So, Hank.

    It's been a few years since this post, and I still don't see an implementation of a graph database available that makes me want to hop out of my chair.

    I still google on a weekly basis (at least) to keep taps on what's happening, but there have been no major shockers for me.

    It seems like there was a lot of fuss and hype for a couple of years, but nothing much happened. A few implementations of schemaless document databases, sure. But no real breakthrough for graph databases.

    What's your view of the state of affairs for graph databases?

    ReplyDelete
  17. mindplay.dk,

    from what I have heard neo4j is pretty good. We are also working on something pretty cool that is approaching completion. We are focused on designing something really useful for mainstream web applications.

    ReplyDelete
  18. I'm one month late but anyway I would like to add that there are currently some graph databases in the market that give pretty good results. As Hank says Neo4j, but also DEX, OrientDB, InfiniteGraph or Sones. You can find here these implementations http://www.graph-database.org/implementations/. If you are interested in more info on DEX, you can contact me and I could give you more details.

    Damaris Coll
    http://sparsity-technologies.com

    ReplyDelete