I wrote about the power of abstractions in January, but based on some of the recent database discussion I think its time to discuss the subject again.
I love abstractions.
The essence of the concept of an abstraction is a framework that simplifies how you think about and work in a given domain. Abstractions can be (and often are) argued against by suggesting that you don’t really need them. In computer programming, we didn’t need C because we had assembly language. We didn’t need C++ because we had C. We didn’t need Java because we had C++. To me these arguments (which people really made) were silly. I abandoned assembly language in the 90's.
The point is none of our existing abstractions are *needed*. But our human brains can only manage a certain amount of complexity at a time. Complexity is fine but only in bite size chunks. Abstractions are, in essence, a really generalized form of user interface. I think the reason I have written so much about user interface is that I think it is so important to figure out how to map complex things into representational models in such a way that more people can access them. Abstractions allow us to do just that with any process we are engaged in. Abstractions allow us to encapsulate complexity so that we don’t have to think about it and we can achieve greater and greater levels of complexity in an efficient way allowing us to keep more of the model of a given system in our heads.
And so the anti-abstraction argument rears its head in the RDBMS vs graph database debate. One of the arguments that the pro RDBMS folks make for why there is no need for the graph database model is that you can do everything that can be expressed in a graph database in a relational database. And there is some truth to this.
But there are two problems with this argument. The first is that this is only true in theory. It is not possible to build a graph database of scale using pure SQL – at least with the SQL tools that we currently have to choose from.
One reason for the scale problem is the only way to do it is to do what are called self-joins. This is when you join a table to itself. Conceptually seem just fine. But the problem is that it is impossible for the database engine to do anything other than brute force un-optimized traversals of the graph when confronted with a chain of self-joins. In other words, using this technique will not yield a useful database that is query-able at any kind of scale. Handling certain aspects of providing a graph database model requires some very specific and different kind of thinking and optimizations from those that go into designing an SQL database.
Another problem is that one giant table using self joins for traversal means a huge write bottleneck. Yes, you can avoid that with sharding depending on your design, but it is definitely not part of the SQL model, and so you can’t say SQL is helping you here.
The second and I believe more important argument against implementing a graph data model using SQL is that even if SQL could do a good job of representing a graph model, building your graph system in SQL is not a very good abstraction. The truth is that most of the kinds of things we want to do in app development look more like graph than relational structures. Graphs are elemental to computer science because most interesting algorithms and in fact real world data models can be very naturally thought of as a graphs. Graph theory is (if things are as they were when I was in school) the first thing you learn when you begin studying computer science, and there is very good reason for this. The fact that Facebook was able to anchor the idea of what they were building as a “social graph” is an incredible testament to the innately natural characteristics of the graph concept.
So if you are representing a graph, you really want an API that reflects the unique and useful characteristics of a graph. In other words, you want an abstraction that reflects how you really think about the data and not some jury-rigged representational model continuously intruding itself into your thought process. And so, having a data store that allows us to express our data in a way that is much more similar to how we actually think is enormously helpful.
And such is the case with attempting to implement a graph database using SQL. You can do it, but it is unlikely to work very well, and because you don’t have the benefit of the abstraction, it actually adds to the complexity of the design instead of simplifying it.
The bottom line is that graphs are a better representational model when the structure of your system will change frequently. Relational is a better model when the structure will be static. Today, I think most of us are not building applications that are ideally structurally static.
Because most applications today have a much more dynamic nature, graphs are, for most people, under most circumstances, a far better abstraction. And to me, there is little in this world more powerful and satisfying than a great abstraction.
Friday, March 28, 2008
Subscribe to:
Post Comments (Atom)

8 comments:
Yep. The thing that naysayers don't get is that working at a higher level of abstraction can make complex tasks much easier to do.
Yes, you can code a GUI in assembler, or C, or C++. But it's much easier and more pleasant to do in Java, since you don't have to deal with machine instructions, or manual memory management, or manually painting textboxes and drop-shadows on the screen. Java does all those things for you under the hood, so that you can just deal with the simple abstraction of "put a text box here".
And similarly, since Java is just a general-purpose programming language with an add-on library that handles the GUI abstractions (Swing), no doubt there's an even better language that could be conceived for writing GUI's that would be at a higher level of abstraction than Java code, and would therefore be easier to code in, more concise, and more powerful.
As this pertains to graph databases, the best support I can see for this is in a couple of the comments to your "Death" article over on Hacker News (http://news.ycombinator.com/item?id=147124):
3 points by staticshock 1 day ago | link
The problem with RDBMS approaches is that the good ones assume you can pack your complex logic into a monster query or stored procedure and let the query optimizer do its thing. But if you're implementing an attribute-value system or graph traversal on top of an SQL database, you end up generating a ginormous number of queries just to do some basic traversal.
1 point by staticshock 1 day ago | link
Now, I may be pretty naive here, but if you're doing full on graph traversal, why not just extract the full graph from the database and traverse it in memory on your own terms instead of leaving it to large unoptimized traversal queries?
2 points by wheels 1 day ago | link
For the latest data set that I'm working on there are 5 million nodes and 50 million edges, and each one has some meta-data associated with it. :-)
Translation:
Can't you traverse a graph using just SQL? Yes, but it's a huge pain, since you need to use "a ginormous" number of queries to do it.
Can't you just use a workaround that makes it easier?
Not any workaround that scales, no.
I have to say, Hank, I'm curious to see what implementation you're going to come up with that's going to address these issues.
Yes, where do you get this magical graph database today? Wikipedia articles say its an "older model" but you dont really see any software packages for them.
I apologize if I'm touting my own horn too much here. Hank: please let me know if you feel I'm hijacking your comments to plug my own technology and I'll shut up.
But my company's product is a graph database. It's available for download as open source (AGPL) here. It's written in Java. It scales to billions of nodes, yet is a small Java library (jar file <500k). It's fully transactional (two-phase commits, transaction recovery, the whole shebang). It's natively graph -- filesystem layout, storage manager, etc, all custom-written for high performance graph traversals. We think it's pretty cool and will launch it commercially very soon.
We have three core abstractions for representation: node, relationship (connects two nodes) and property (key-value pair that can be attached to both nodes and relationships). We have a few more for traversal. But that's it. The javadocs are online here.
Would be very interesting to hear some comments. (We do know that the documentation is in a poor shape. Working on it tho!)
Emil Eifrem
http://neotechnology.com
@anonymous
There is a whole industry of object databases and has been for well over a decade. There are also several open source products.
http://www.odbms.org/
Graph theory as the basis for databases is really a pretty old idea that only looks new because it's been buried for a long time. If my grandmother found a Mac Plus at a garage sale she might find it pretty cool and amazing, but no one who has used a computer since 1984 would be impressed.
I won't refer you to Chris Date again, except to say that he has covered this topic at length. You may not be aware of parallel work in this regard that was prompted by XML and attempts to make XML-based "databases." Fabian Pascal has some cross words here on his Database Debunkings site (which Chris Date and Hugh Darwen contribute to).
Greg,
You seem intent on proving things by making references to irrelevant writings. XML is indeed a graph, because in fact almost every data structure is a graph. However all graphs are not represented as a tree (which is what XML is).
More generally, graphs are not a fad any more than water is. The concept of framing a graph as such is silly. But then thats not what the article says. But I dont want to get into defending something that has nothing to do with what I am talking about. By the way, and object database is not a generalized graph database either, though it is indeed navigational. It certainly has no concept of "edges" vs "nodes" for example. In fact neither would an XML database, at least in a generalized sense. It appears that you dont really understand the arguements that you are making but seem satisfied to quote others without understanding what either they or I am saying - or perhaps just not reading carefully enough.
Ah, finally someone agrees with me. =) I thought I was the only relational database hater out there, because everyone seems to be content with MySQL and all those.
Anyway, about your post, I agree with the fact that graphs represent most data these days much better than a relational database ever could. That being said, I disagree with the fact that we can use abstraction to work around it. You can use an abstraction framework all you want, and it will surely make your life simpler, but the database at the heart of the application will still be relational, and I think that's where the problem lies.
Personally, I think no matter how you abstract it, the problem is still there. The problem, at least in my opinion, is in scalability. Abstraction can help with the coding, but it isn't going to make your database run faster. A relational database using things like self-joins simply can't scale as well as a navigational database that was meant for the task.
Now, there are lots of navigational databases out there, but then the problem is that they don't follow that graph pattern. So no matter what, you're stuck having to create an abstraction layer over the database.
Basically, my point is that, while abstraction is a great thing, having a database that removes the need for abstraction would be even better. I'd much rather see an actual graph database that does everything I need it to (and well, for that matter) than an abstraction framework that just turns the relational data into a graph.
I hope that makes at least some sense.
great, great, great article - you laid down what i've been thinking for some time. I want to underline a question presented earlier - can you recommend a good graph data store - i know there's alot of object databases out there, and maybe one of those will do the trick - the closest thing i've found thus far is graphd http://blog.freebase.com/2008/04/09/a-brief-tour-of-graphd/ from freebase - i'd love to find something similar - preferably opensource and really simple
Post a Comment