Tuesday, February 5, 2008

The Death of the Relational Database

The relational database is becoming increasingly less useful in a web 2.0 world. The reason for this is that, while the relational database model is great for storing information, it is horrible for storing knowledge. By knowledge I mean information that has value beyond the narrow current conception of the given application. I mean information that can have enduring value. In this context, one might say knowledge is information in non-disposable form.

The reason the relational database doesn’t represent knowledge very well is that the relational database is only good at storing objects and relationships between them when one fully understands exactly what objects and what relationships will be managed upfront. When you need to represent some new type of relationship between the objects in a relational database, it tends to fail, or be very difficult. In fact, the relational database isn’t even particularly good at adding new types of objects to the database. Most relational databases actually have an upper limit on the types of objects, typically referred to as tables, which can be handled. Too many tables in a database schema is considered bad design.

The way I usually describe the situation is to say that the relational database is brittle but strong. As long as you don’t want to radically change or expand the scope of what you are doing, relational databases are great. But knowledge is an ever-expanding universe of objects and relationships between them. The relational database doesn’t handle that use case very well.

Storing the relationships between objects *in* the objects is a problem.

The essence of why the relational model doesn’t handle the more dynamic model of knowledge as opposed to information, is that relational databases are built around the idea that the relationship between objects is *built into* the objects. For example, invoices are typically stored as one type of object in a database. Customers are a different type of object. An invoice knows *as part of its structure*, who the customer is. That pointer to the customer is stored *in* the invoice.

This is bad.

The reason it is bad is because it means that in order to create new relationships between different object types we need to modify those object types. For example, if the developer decides to allow payment records to be connected to invoices, either the structure of the payment record or of the invoice must change. So, with a relational model, you really want to make all the decisions about the valid types of relationships between objects right from the very beginning because you don't want to have to modify the structure later.

“Excuse me Mrs. Smith. We require you to decide on all of your child’s friends for life before you go into labor.”

Think about this.

Needing to know your database structure upfront is like needing to make a list of all of your unborn child’s potential friends. Forever. This list must even include future friends that have not been born yet, because once the child’s friends list is built, adding to it requires major surgery.

This rigidity prevents most developers from trying to build knowledge. They just capture information. Data is stored in separate unchangeable relational silos. Every time we think of a new way to represent or expand information we just make a new silo, because adding to or modifying an existing silo is way too difficult.

The societal implications of this fracturing and splintering of information are profound. And yet, the converse is incredibly empowering. How great would it be if when we thought of a new piece of information that we want to capture, we could simply add it to our existing database? Or perhaps if we can add things this easily it is more like a knowledge base than a database. Such flexibility would mean that we would have the benefit of leveraging the new information in the context of the existing information, building newly accessible insights along the way.

For example, imagine starting out with a contact list. Some months later, you add a restaurants list. Some months later again, you decide it would be great to be able to capture, for each contact, what their favorite restaurants are. Ideally one would want to just establish a “favorite” relationship between a restaurant and a contact without changing the restaurant structure or the contact structure. This is a simple example, but the bigger point is that relationships between pieces of information will always grow more complex tomorrow than they are today. Capturing and leveraging new types of information to increase knowledge should be a key design goal of modern databases.

Those computer science guys are on to something with that graph stuff.

The concept of having relationships between objects be separate from the objects themselves is the core concept behind what is known in computer science as graph theory. Graphs are collections of pieces of information and the connection between those pieces of information. In a graph, the pieces of information are called “nodes,” and the connections between nodes are called “edges.” Computer scientists like graphs because they are a universal way of expressing literally almost any type of information.

Too many computer scientists spoil the semantic web stew.

The graph is the underlying model of a new highly discussed but rarely used data storage concept called the semantic web. The semantic web is really, in its simplest form, the idea that information on the web should be stored in databases structured like graphs. This would allow information on the web to be much more intelligently accessible and expandable in a way that relational database systems are not.

Sounds good.

Unfortunately, the semantic web is proof that while a little geek is good, but too much geek is, well, too much geek. The problem is that the people that created the semantic web were just way too smart. In fact if you read even the watered down Wikipedia description of the semantic web, it sounds like useless abstract gobbledygook. As a result, the semantic web is too great a leap from the tried and true relational database. In fact, it doesn't even feel like relational database users were a target audience for the semantic web architects. But whether they were aggressively targeting mainstream database developers or not, the gap between the two methodologies is far too great not only because the semantic web is hard, but because relational tools are being greatly simplified, which just increases the gap.

Specifically, newer technologies like the ActiveRecord system in Ruby on Rails, have done a great job at abstracting much of the mind numbing complexity out of the relational model. Now, along comes the semantic web just in time to make us all feel really dumb again. The semantic web makes the relational database model feel positively Fisher-Price. The semantic web is, and will be, for most developers, a non-starter.

Hey man, I just wanna build a little web app!

But the biggest issue with the semantic web is that it is really conceived to solve problems that your average everyday web developer just doesn’t care about. It is an ivory tower solution. Ironically, the concept of a graph representation of information is totally relevant to someone building a web 2.0 application. But the tools, languages, and methodologies of the semantic web do not have the scrappy, agile, PHP web developer in mind. And so, for most such web developers, the semantic web is irrelevant.

And so, the relational database is old and ill suited to the modern data management world. The graph model is much easier and more appropriate for typical web tasks. But it needs to be productized in a way that makes it easy for developers to fit it into their workflow.

Of course, once you start thinking of information as a graph, all sorts of interesting things become possible. There is much more to talk about, but for now this should be sufficient food for thought.

54 comments:

Alex said...

Very interesting discussion. How about a compromise? How about taking all the foreign keys out of object tables in a relational database and creating separate tables that just represent relationships - so you have tables that are object (nodes) and tables that are relationships (edges). Would that make you happy?

Of course more joins in a database usually means worse performance, but that's what caching is for.

Marc said...

Good point alex, the exact schema I'm working on for a new project of mine. However, forget about the joins and just duplicate the data in your relationship tables.

adamo said...

"For example, invoices are typically stored as one type of object in a database. Customers are a different type of object. An invoice knows *as part of its structure*, who the customer is. That pointer to the customer is stored *in* the invoice."

No the invoice does not know. However when implementing the database most people choose to eliminate the extra (two-column) relationship table to avoid the extra joins.

Other than that, it is an excellent write-up of why RDBMSs cannot support other non-predefined data types. I recommend the The Lowell Database Research Self Assessment to be read right after your blog post.

Hank Williams said...

@Alex

Well removing foreign keys is a *big* part of what I am ultimately talking about here... this is a multi parter believe me!

A big part of this is figuring out how to scale horizontally. Just to let you in on a little secret that this issue is a big part of what I am working on right now.

@marc

Certainly you have to be careful but I definitely believe we are living in an era where we will begin to agree that normalization is not the be all and end all of database design, and that it is in fact, at times a *big* disadvantage.

@adamo

Of course you are right that one doesn't *have to* implement an invoice system with a foreign key pointing to a customer. But it is a very clean example, and the truth is that most people do it that way because it is easiest and involves fewer tables. I could have used an intermediate table in the example but it would have made things much harder to explain. As it was I was concerned this post will lose some of my audience :)

Anyway, thanks much for your comment, and the microsoft link. I am going to check it out now.

adamo said...

@hank:
"A big part of this is figuring out how to scale horizontally."

Then you might want to check out C-store.

michael said...

The semantic web is not the answer.

The RDB is not dead and won't die because it is a good way to organize information in a well-known situation. You are exactly right, however, to point out that an RDB is a specific and rigid knowledge structure. As such, what you can ask is determined by the knowledge structure that was implemented when the DB was designed. This is why good DB designers are worth more than they are paid. So RDBs won't die, but people need to understand they cannot be extended (because of their nature) to capture and store _usable_ knowledge to do the kind of new work needed by knowledge-directed ventures.

Much of the Web2.0 (oh how I hate that term) thrust is in the direction of helping people do something useful rather than simply transferring information. Not surprisingly, doing something useful often involves appropriating knowledge to solve a problem or combining it with personal knowledge to create something new. So we really want mechanisms to describe and transfer knowledge.

Unfortunately the semantic web, as now conceived, cannot do either. Graphs are a powerful representation of relationships, but the crucial conceptual leap to semantics is not enabled with RDF (or KIF etc.) because these are all just ways of writing down logic relationships between things and attributes. It's still just syntax. The semantics is the hard part, and that bit is just hand waving for the semantic web initiative.

This idea that knowledge can be reduced to logic has been around since the Greeks and it has been a basic theme in modern AI. There are well known and fundamental problems with this approach. Unsurprisingly, progress in this area has been very slow (look at the glacial pace and low impact of the Cyc project). Step back and think about how you use knowledge to gain more knowledge or solve problems. It really misses the mark to say it is just simple deduction (or induction) using rules on information.

Consider the problem of conveying knowledge to someone else in such a way they can understand in what circumstances the information you told them can be used and the situations in which it cannot. How many rules need to be transmitted at the same time? We rely on context to do that - for example having the conversation in a business strategy meeting. That type of problem illustrates a fundamental need to solve the problem of the relationship between context, knowledge, and a person's situation when they consume or use that knowledge. The meaning and usefulness of transferring information and knowledge somehow arises out of an interaction. That is, the meaning (semantics) is constructed in a particular situation - not frozen into the words that are the 'surface' expression of that construction. To be more concrete, the words on different web pages may look related at the surface but be the result of profoundly different construction processes. The sematic web will happily tell about this emergent 'knowledge' because they can be composed into a graph.

The semantic web, as now conceived, amounts to no more than a meta-tagging system focused on text elements within web pages - not much more than XML. The semantic web is a leap of faith that attaching RDF structures to web content will result in knowledge automagically emerging as the subgraphs are joined. That's still syntax, not semantics. Librarians have known about this problem for a thousand years or so. Web2.0 projects interested in knowledge-based models will need to rely on human knowledge and person-to-person transfers as the secret sauce even if enhanced schemas help to enable that.

Hank Williams said...

"This idea that knowledge can be reduced to logic has been around since the Greeks and it has been a basic theme in modern AI. There are well known and fundamental problems with this approach. Unsurprisingly, progress in this area has been very slow (look at the glacial pace and low impact of the Cyc project). Step back and think about how you use knowledge to gain more knowledge or solve problems. It really misses the mark to say it is just simple deduction (or induction) using rules on information."

Suffice it to say, I am not a big believer in AI, or in anything done with a computer that could really be categorized as "intelligent", or true human knowledge.

The semantic web fails in part because it tries to be more that it is reasonably possible to be. My view is that a dumbed down, simple version of RDF (or really graph representation) can move us forward. By accepting our limitations and working within them rather than trying to be "too smart", is always the win. I see this as being similar to HTML's success over SGML, it predecessor. I believe in taking simple things and building on them. So when I use the word "knowledge", I am not using it in the broad human sense, but in the narrow way I defined it at the beginning of my piece.

In essence, what I am suggesting is that by making the modern database a graph model instead of a relational model, we are going to be able to create interoperability and classes of applications that have not been possible before. Am I suggesting HAL? No f'ing way! Just something that is a substantial increment better than the way we do it now. Some of what I am suggesting is a bit vague and will be covered in another piece. I am sure concrete examples will help.

michael said...

I think we are in agreement here about RDF and RDBs, My emphasis is to say a graph structure is not really a DB anymore -- we are moving to entirely another way to organize information and use it. The DB exists in this new system - but it is only a filing system, it no longer usefully contributes to computing solutions.

An object database can accomplish the same thing and is more flexible than even graph structures because it can accommodate fuzzy (probabilistic) relationships and participate in ensembles of modeled relationships between the individual objects. The work , however, is being done in the objects and a computational process over those objects not in the DB system per se.

Nonetheless, I like graph structures because they can be used to model things in ways that can be usefully manipulated by ordinary people. For new ventures, the attraction is that these models can be constructed incrementally using contributions from people and information sources. Again, using RDF or some other system (I'd say RDF is already dumb) is a reasonable way to describe the pieces that go into this construction.

adamo said...

@hank:

I think that what you are trying to do with graphs has already been done in the Network Data Model.

Hank Williams said...

@adamo

Yes, Network databases, Object databases, graph databases, etc are all part of the broad family of what used to be derisively called navigational databases. Nothing is ever new. Its about how to apply things to make them more relevant.

@michael
I think we agree about RDF being dumb. I really am saying that I think the way the semantic web folks do it does not resonate with the typical web developer. When a web developer checks out the "semantic web" package, their eyes glaze over. It needs to feel like a small increment from what they are doing already. Part of the magic here is making things "not too hard" for people to get into right away.

michael said...

"When a web developer checks out the "semantic web" package, their eyes glaze over. It needs to feel like a small increment from what they are doing already. Part of the magic here is making things "not too hard" for people to get into right away."

Sure, that's ideal but the change here is not incremental. To make it "feel" incremental will probably require an abstraction framework that accepts DB like talk combined with some semantic assertions and produces/updates a model that is managed with a navigational DB. A hard project for the general case, but tractable if the situation domain (or the task domain) is well specified.

Hank Williams said...

@Michael,

Stay tuned :)

michael said...

Hank,

Re: Web dev activities and graph-structured interactions

A bit off topic perhaps .....

Note that continuations can be seen as nodes (with incomplete slots) in an evolving graph, so a website need not be designed to preserve state in the usual sense. Think of it as more of an ongoing interaction with an object (ie the website) where the pages seen by a user is a composition of process(es) available through the object interface. Of course the object itself (==your website) can be evolving.

Hank Williams said...

Michael,

Wow dude, way off topic...:) but yeah we are actually using lisp for some of our web stuff and a continuations based framework called weblocks.

Its interesting that you express this idea in the form of an evolving graph. I don't think it maps exactly to the graph model for persistent data, but I will need to ponder that.

My first take is that I think that there is great value to separating the model from the view. And what you are talking about is a graph of the view state, which generally wont map to the model state, at least exactly, since the view will contain ephemeral states that the model will not need.

Marc said...

@hank
"As it was I was concerned this post will lose some of my audience :)"

You gained one from me, and I'm sure others. I'm eagrly awaiting your db abstraction framework.

michael said...

To continue the off topic discussion....

1. Scheme _is_ great for this type of stuff. (And, btw, Smalltalk is terrific for building objects in a distributed environment.)
2. To really push this idea, I'd argue Model-View separation is critical because View can is tuned to the representation (experience) desired by the user, which depends on their situation. The model can keep two representations - one is the internal state of the object and the other is the evolving state of interaction with the user (this is the website experience as seen by the web site).

So the idea of traditional 'pages' (even dynamic pages) just goes away - the code handed to the user's browser is the representation (in html, proprietary code that runs on a plug-in, or whatever) of what the system thinks the user wants. This results from the previous interaction (where the system has been working to understand something about the user's intention) and doing computation on its internal representation and the representation of the user interaction. Of course these two internal representations may have relationships to one another that constrain the computation.

Douglas Thiel said...

I like the general thrust of your argument - RDBMS are basically too rigid to help us manage our knowledge. OTOH, if you separate associations from objects then somewhere along the line you will end up in a hopeless situation of trying to add a new type of object and then link it to the thousands of objects you already have. Clearly some kind of automated semantic analysis must take place to make this satisfactory.

Pawel Lubczonok said...

Hi Hank,

Great text!!! I agree with you totally regarding the RDBMS. In fact, development of RDBMS is one of those things that have created unnecessary complexity in IT. (Maybe, one needs to err in order to discover the right path.) It is amazing to see how IT directors cling to RDBMS model and a plethora of standards (that change and get superceeded rapidly thus contradicting the word itself) as a security blanket and it actually makes life very difficult for them.

Anyway, the ideas of a graph is much better but not good enough as information has a structure.

Knowledge has structure too that is different from information. The construction of RDBMS is a result of NOT finding this structure to information - all is just tables. Similarly, all the stuff in the Semantic web RDF, OWL etc. etc. is like RDBMS for semantics.

I am in total agreement with you regarding inadequacy of currently proposed structures for the semantic web. They are no good. Too complicated and academic. They must be simple and direct so that people can express themselves simply and then be able to read it!!!

We have been working for 10 years on replacing RDBMS and introduce semantic forms into IT. This has resulted in our soon to go live WEB offerings under ThoughtExpress.Com.
Here one will be able to organise one's life, run enterprise (we already run 4 large insurance companies entirely on semantics and new kind of information store in on site model)

For your information, by not having RDBMS and RDF etc, we are able to use domain experts that do not have any knowledge of IT to configure/express semantics in our system

Pawel Lubczonok

davy boy said...

This probably won't ping it's way to you... but your DB thinking seems pretty much inline with the paper that i'm writing... here'r some of the blogposts that they come from. http://davecormier.com/edblog/category/rhizomes/

the knowledge/information distinction as it applies to 'new knowledge' is key. would love to have a chat about that sometime.

Anonymous said...

Dude, you are so far off base. I work with relational databases every day, and they are far from inflexible.

If I want to add a column to a table, I just add it. It doesn't cause any problems at all.

If I don't want to change the table itself, or if I need a many-to-many relationship, I add another table...in your example, a customer_invoice table, with customerid and invoiceid. No changes to customer or invoice at all.

But for a lot of types of relationships, I don't need to add anything, given the flexibility of SQL queries.

I spent the first five years of my career working on a web startup, built around a relational database, gradually adding new features over the whole five years. Building the web code was the slow part. Adding new relationships in the database was trivial. That's sort of the whole point of the relational model.

And "mind-numbing complexity?" SQL was the first thing I learned, along with VBScript and HTML. We have people at work who will never be much at general-purpose programming, but get by in SQL just fine. It's just not that hard.

I don't see why so many programmers have such a mental block about it.

I agree that RDF isn't likely to set the world on fire.

Anonymous said...

"Data Structures and RDF" -Neal Deakin

http://www.xulplanet.com/ndeakin/article/133/

Greg Jorgensen said...

The "why relational databases suck" topic is pretty well beaten to death by people who don't get RDBMSs. Some of your arguments are faux-philosophical, like information vs. knowledge. Some, such as the comments on foreign keys and normalization, are just repeating received wisdom about RDBMSs.

Let me take on your first example:

For example, invoices are typically stored as one type of object in a database. Customers are a different type of object. An invoice knows *as part of its structure*, who the customer is. That pointer to the customer is stored *in* the invoice.

This is bad.

The reason it is bad is because it means that in order to create new relationships between different object types we need to modify those object types. For example, if the developer decides to allow payment records to be connected to invoices, either the structure of the payment record or of the invoice must change.


That's right; to link payments to an invoice you would probably store the invoice key in the payment table. That would be part of the design of the database. A non-relational solution would be a separate table of invoice identifiers and payment identifiers.

But then you go off into the weeds:

Needing to know your database structure upfront is like needing to make a list of all of your unborn child’s potential friends. Forever. This list must even include future friends that have not been born yet, because once the child’s friends list is built, adding to it requires major surgery.

Actually your own example is more like having to know your own species (payment on an invoice) and your parents (the specific invoice) in advance, which are pre-requisites of human reproduction, not bothersome restrictions. And adding more invoices to a customer, or more payments to an invoice, only means creating additional instances (rows), not any structure change akin to major surgery.

In this simple example, exactly what "knowledge" is being lost in the process of reducing customers, invoices, and payments to mere information in a table structure? The knowledge of the relationships among the entities is clearly and unambiguously represented in the data, not in hidden metadata or pointers.

As for graph theory, that sounds cool and all that, but think for a second how graphs are represented. Either the nodes contain pointers to other nodes, or there's a list of edges defined by the two endpoint nodes. In either representation the structure is embedded in the data. It should go without saying that the types of graphs you describe are easily represented in a relational database schema.

As for ORMs such as ActiveRecord, it's not really a "new technology" invented by the Rails people; it's a design pattern described by Martin Fowler and it predates Rails. ActiveRecord has its place for simple database operations (what the Rails camp calls CRUD), but for more complex -- relational -- work it starts to break down and becomes more complicated to work around. It's only a dumbed-down layer on top of the RDBMs (actually on top of SQL), not a replacement.

Let me recommend Chris Date's excellent book "Database in Depth: Relational Theory for Practitioners." That book is probably the best explanation of relational theory for working programmers, and may help you and some of your readers get over the common misconceptions about RDBMSs exhibited in the article and the comments.

Hank Williams said...

"Actually your own example is more like having to know your own species (payment on an invoice) and your parents (the specific invoice) in advance, which are pre-requisites of human reproduction, not bothersome restrictions. And adding more invoices to a customer, or more payments to an invoice, only means creating additional instances (rows), not any structure change akin to major surgery."

Heavy sigh.

The point here is that while the structure of an object may be fixed, like an invoice, the ability to relate an invoice to something that you conceive of later, that may not even exist when you created the concept of invoices, is something that requires modification of your schema. It is not a lightweight action. It is not something that can be done by the data user. It must be done by the person responsible for structuring the system. This is bad in a world where the relationships that are possible far outstrip our ability to predict those relationships upfront. The web world is a perfect example of this.

"In this simple example, exactly what "knowledge" is being lost in the process of reducing customers, invoices, and payments to mere information in a table structure? "

Nothing is being lost by storing information in tables. The problem is when someone wants to connect something unexpected to an invoice. Or to a person. Or to an event. Or to a check. In the RDBMS world, these are sophisticated concepts far removed from the person that really understands the relationships, i.e. the user or domain expert. As such new data types cant be linked into existing data types without restructuring and schema modification.

And finally, as an alternative to reading any books you might suggest, I would strongly suggest the converse, which is that you check back here every now and then so that you will be aware of the work we are doing in this regard so that you may come to understand what is really possible once you free yourself from the rigid thinking that binds you.

As an FYI, there is an entire area of computer science focused on the issues that I raise in this piece. And while I do not at all like the design of the semantic web, RDF and RDF triple stores are in large part designed to resolve the problems I am discussing here. So when you suggest that these problems don't exist, or that we need to be educated by those of you with "superior understanding", perhaps you should start by send Chris Date's book to Tim Berners Lee.

adamo said...

@hank:
"[...] is something that requires modification of your schema. It is not a lightweight action. It is not something that can be done by the data user. It must be done by the person responsible for structuring the system."

Which to me means that you have either worked with DBAs that do not like to provide such facilities to the end user, or do not know how to provide that.

As to the fact of connecting the unexpected, before resorting to graphs (which eventually you are either going to store in an RDBMS or reimplement one) this has already been done a number of times on both relational and not relational systems. IBM EAS (previously N.O.R.A.) is an example.

And while we are at it, bringing Tim BL to the discussion this way does not strengthen your argument.

Hank Williams said...

Adamo,

"Other than that, it is an excellent write-up of why RDBMSs cannot support other non-predefined data types."

I guess you've changed your mind. Perhaps you should decide what you really think before writing.

"Which to me means that you have either worked with DBAs that do not like to provide such facilities to the end user, or do not know how to provide that."

The irony of that statement is almost hard to capture. But rather than being sarcastic I will just try to address the core issue. Needing a DBA to manage your data is like needing a personal chef to eat.

"As to the fact of connecting the unexpected, before resorting to graphs (which eventually you are either going to store in an RDBMS or reimplement one) this has already been done a number of times on both relational and not relational systems. IBM EAS (previously N.O.R.A.) is an example."

And the fact that what I am talking about has been done before means what exactly? Do you think the point of this is that ideas that *don't* exist are going the kill databases? That would be kind of difficult wouldnt it?

"And while we are at it, bringing Tim BL to the discussion this way does not strengthen your argument."

Sure dude. Whatever. My point is that there are a lot of people who think this *is* a problem. He's not the only one, and as you know I don't much like his recent work. But he is the most famous, whether we like it or not. But to suggest that I need to read some book on databases to disabuse me of the belief that there is a problem with relational databases that new technologies can solve is stupid.

Synchro said...

I'm guessing that:
a) You never used a Newton
b) You never used Valentina

Newton allowed you to link pretty much anything to anything, and every "record" could have different fields. Of course it's entirely feasible to model this structure using a simple name-value pair bunch of linked tables, but it certainly won't scale.
Valentina (paradigmasoft.com) can deliver sometimes several orders of magnitude more performance than the likes of MySQL and SQLServer, does many:many joins without intermediate tables, and has some interesting OO concepts such as inheritance all while maintaining a standard SQL interface. Good stuff.
RDBMSs are a compromise, but the payoff comes in the form of massive performance. It's easy to produce a system that allows vague, arbitrary, inconsistent relations between random items, but getting the performance up to match is a much, much bigger problem.
"Needing a DBA to manage your data is like needing a personal chef to eat."
Hm. I wouldn't say that. I'd turn it around - managing a database without a DBA is like letting children drive trucks. They can probably do it, but I would fear for the safety of anyone nearby. If computers were at the point where any search on a vaguely defined terabyte+ dataset could be accomplished instantly without any thought or insight, your plan might hold water. Before that happens, we need much bigger innovations than graphs, and at least a dozen or so iterations of Moore's law.

Hank Williams said...

I'm guessing that:
a) You never used a Newton
Incorrect. I was one of the first developers invited by apple to develop for the newton. I was working with it almost a year before launch and knew many of the developers and most of the executives on the project.

b) You never used Valentina
Correct. Ya cant use everything.

why in God's name does any of this matter. I have no idea what your point is or whether I have used product a or b or c has anything to do with the theoretical issues at play here.

"Needing a DBA to manage your data is like needing a personal chef to eat."
Hmm... I wouldn't say that. I'd turn it around - managing a database without a DBA is like letting children drive trucks.

Well, I have been writing software for 30 years, and that argument always comes from people that seek to protect what it is that they do. Its the same argument that the command line folks made about the mac. Its the same argument that graphic designers made about desktop publishing. Nothing should be too easy. It is "dangerous". All that data is going to really hurt people eh...

Bull.

adamo said...

I think we have to attribute the fact that you think I am contradicting myself to English not being my native language. Indeed, when you have a data universe where you need to define data types all the time, it is quite difficult to work in a RDBMS environment. This is not the same as finding unexpected relationships (which you can find even when your data types do not change).

It is not that I have changed my mind. It is that I like graphs. I really really like graphs. And in your discussion of your graph data model I have not seen the math of it. Which leaves me wondering as to what exactly is the problem that you are trying to solve:

1. Are you trying to solve data presentation / visualization? (Item 3.11 of the Lowell report).
2. Or are you trying to solve data storage?

If the later, are you aiming for a mostly read only system? What about updates and concurrency? Where is the theory behind that and how is it linked to graphs?

I brought up the subject of the DBA. Maybe I should have asked first: For what kind of data size are we talking about? The kind that needs the DBA or a personal user database? Or are we talking about the possibility of eliminating the need for a DBA?

The fact that you are trying about things that have been tried before makes me want to know what is new or different in your approach.

The fact that I am responding although we are in disagreement shows that I find value in your writing.

"to suggest that I need to read some book on databases to disabuse me of the belief that there is a problem with relational databases that new technologies can solve is stupid."

Basically, it is simply wrong to use a tool (any tool) for a purpose it cannot fulfill and then blame the tool.

Anonymous said...

DBAs are good when you have a large corporate database system that needs to be securely backed up, replicated, tuned, etc etc. On the other hand, you don't need a DBA to embed a Sqlite in a desktop app.

Sqlite is like eating dinner. The giant corporate database is like running a large restaurant. For the latter, it kinda helps to have a chef.

Hank Williams said...

Adamo,

Fair enough on all points, particularly the english translation issue.

You also have asked some very good questions here. let me try to go through and answer some of your questions.

"1. Are you trying to solve data presentation / visualization? (Item 3.11 of the Lowell report)."

I dont have the lowell report in front of me, but we have solved some important visualization problems.

"2. Or are you trying to solve data storage?"

We are also solving *some* storage problems. They relate to what we think is the most common use case. But I would be remiss to suggest that every conceivable data storage problem is resolvable with our technique.

"If the later, are you aiming for a mostly read only system? What about updates and concurrency? Where is the theory behind that and how is it linked to graphs?"

Well some of these are questions that I am definitely not ready to answer yet. We havent even really publicly announced the name of the company :)

But I will say that read only systems, or primarily read systems are really easy to scale. One of the things that we had to figure out how to do was to scale writes *really* well. A big part of the answer is indeed in concurrency and how you shard your data. And deciding early on that normalization is not your friend. Being willing to break some of the old rules is key to new understandings. This allows for concurrency which is typically not available with relational databases on writes because you end up with the write bottleneck. I will have to decide how much more than that I want to talk about.

Regarding the DBA issue, we are dealing with both personal databases and databases of a type that map pretty closely to web application type data. I dont think we will be appropriate for running transactional accounting system. But this is part of the problem. Web apps are using tools that are not optimized for them. For these kinds of apps a web developer will not need a dba.

"Basically, it is simply wrong to use a tool (any tool) for a purpose it cannot fulfill and then blame the tool."

Yes!!!

Hank Williams said...

"Sqlite is like eating dinner. The giant corporate database is like running a large restaurant. For the latter, it kinda helps to have a chef."

Yes!!!

Greg Jorgensen said...

The point here is that while the structure of an object may be fixed, like an invoice, the ability to relate an invoice to something that you conceive of later, that may not even exist when you created the concept of invoices, is something that requires modification of your schema.

That probably depends more on how well-designed the schema is than any inherent problem with RDBMSs in general. To extend your example, suppose I need to add credits to my customers/orders/invoices/payments schema. I create a new table to describe the credit, include a column for the invoice number, and I'm done (at least as far as the database goes). No change was needed to anything else in the schema to support this new type.

It is not a lightweight action. It is not something that can be done by the data user. It must be done by the person responsible for structuring the system.

For lots of data it's more important to maintain data integrity than it is to make the database friendly to users. I routinely work with people who can't master Excel; it wouldn't make a lot of sense to turn them loose on a more powerful tool with customer data at risk. As another poster wrote that's like letting children drive a truck.

You've been programming for thirty years so you probably know that SQL was originally intended for end-user use. That isn't the same us naive user, though. Lots of people with no formal training maintain and extend relational databases, it's just like any other skill. I would wager that it's easier to teach someone how to properly maintain a relational database than to teach them how to maintain data integrity purely in their Ruby code.

This is bad in a world where the relationships that are possible far outstrip our ability to predict those relationships upfront. The web world is a perfect example of this.

And that's why the web is not a big RDBMS. Your argument is pure straw man. Yes, lots of web sites have RDBMS back-ends, but the web itself is not an RDBMS. Nothing about RDBMSs forces you to predict every eventuality in advance. Nothing prevents you from putting a user-friendly face on an RDBMS; there's no more need to expose end users to relational theory than there is to expose them to TCP/IP.

The problem is when someone wants to connect something unexpected to an invoice. Or to a person. Or to an event. Or to a check. In the RDBMS world, these are sophisticated concepts far removed from the person that really understands the relationships, i.e. the user or domain expert. As such new data types cant be linked into existing data types without restructuring and schema modification.

You've persuaded me that RDBMSs are sophisticated tools that take time and intelligence to master. You haven't shown any alternative that would be both friendlier to non-experts and more powerful.

The web itself gets part of the way there, but it's a different and complementary way to organize information, not an either/or better/worse dichotomy. Would you feel better if your bank kept your accounts in a bunch of linked web pages, or in an Oracle database managed by trained DBAs?

... once you free yourself from the rigid thinking that binds you.

I suggest a book that explains RDBMSs clearly, and I get an ad hominem attack on my "thinking" in return. Thanks!

... perhaps you should start by send Chris Date's book to Tim Berners Lee.

I'll wager that Sir Tim is familiar with Chris Date's work.

Hank Williams said...

Greg,

I suspect you will never comprehend the issues we are raising and the technology we are developing. Thats OK. There are still people that use Yahoo search too.

But I do want to address one issue

... once you free yourself from the rigid thinking that binds you.

"I suggest a book that explains RDBMSs clearly, and I get an ad hominem attack on my "thinking" in return. Thanks!"

Yes. Suggesting I read some intro book on databases is offensive. It is the act of an obnoxious twit in an argument to say "hey you dont know what you are talking about , read this book"

I dont know who you are, but I *strongly* suggest neither your experience or background provides a basis for you to talk down to me, and so I suggest if you want to continue in this vein that you be well armored for further "ad hominem" attacks that match your passive aggressive comments.

"'... perhaps you should start by send Chris Date's book to Tim Berners Lee.

I'll wager that Sir Tim is familiar with Chris Date's work."

hmmm... now I see why you don't understand anything here. The concept of simple sarcasm is getting right by you.

Synchro said...

You really didn't need to be so rude to Greg. He sounded pretty sane to me. How about you explain this mysterious concept better instead? So far we've only seen lots of hand-waving.

You said: "the relational database is only good at storing objects and relationships between them when one fully understands exactly what objects and what relationships will be managed upfront"

That may be true, but to say that this makes RDBMSs useless for anything at all seems a bit baby-and-bathwater. There are an awful lot of problem domains that fit into it very well. I don't dispute that there are many that don't, but that doesn't render it useless. Placing restrictions on the structure is a positive thing for implementation.

The semantic web, RDF et al are only about representation - which has little to do with implementation.

In describing your contact list, every concept and operation you describe is handled trivially by a relational structure. The only effective reason you give for not doing it that way is that it "is considered bad design". Why? Isn't forcing such flexibility what you're after? It may be that doing that doesn't fit some implementations very well, but that's not to say that it might not be an easier route than starting from scratch. As the very first comment said - an edge/node model can be represented very easily in a relational model, so it might make a perfectly reasonable implementation. If you have some radical plan for implementing edge/node representations directly without such overhead, I'm all ears.

I mentioned Valentina because it approaches the same domain as "traditional" RDBMSs with some radically different implementation ideas, for example solving your typing limits by allowing very large numbers of tables with no particular penalties (no more than adding rows to a table), and to group them via inheritance, so a new relationship could simply be a specialisation/subclass of an existing one. Since that's pretty close to what you were apparently talking about I thought it might be of interest, but you chose to be dismissive instead. If you want us to pay attention to what you have to say, how about you listen to us too? Some of us have been in this at least as long as you.

chuck said...


For lots of data it's more important to maintain data integrity than it is to make the database friendly to users. I routinely work with people who can't master Excel; it wouldn't make a lot of sense to turn them loose on a more powerful tool with customer data at risk. As another poster wrote that's like letting children drive a truck.


You know, it makes sense to have a profesional truck driver drive an 18-wheeler. And I certainly wouldn'y allow children to drive much more than a bicycle or maybe a go-kart. But it seems to me that the problem Hank is talking about solving is that right now, we don't have a good solution for letting 20-somethings with normal driver's licenses drive anything.

Hank Williams said...

Synchro,

I have no problem answering questions or even debating technical merits. But I have a real problem with people attempting to lecture me about stuff they don't understand. And the old "I've got a (subject matter 101) book you should read" in these types of discussions is the incredibly offensive.

I also would appreciate questions rather than jumping to conclusions about stuff one doesn't understand. But I am *happy* to start again and have a peer to peer respectful discussion, even if we disagree.

First, I was not being dismissive of valentia. I honestly don't know anything about it and I will not pretend to know stuff I don't. While I have been around a long time, I am smart enough to know that the list of things I dont know will always be far larger than the list of things I do know.

Regarding hand waving, there is some truth to that. This is not a product announcement or even a white paper. I have decided to talk about some of the ideas that are related to some of the work we are doing. I haven't even announced the product or company name. And part of the reason for that is because this is as much about concepts which may be implemented by others, as much as it is about the specific implementations concepts that we have.

Now, regarding graph vs relational, it is indeed true that you can implement graph systems in a relational model. I don't hate relational technology. It can be very useful. But programming is all about abstractions. And providing an abstraction that actually makes it easier for programmers and users to conceptualize and visualize what there data looks like and how to access it is really the first issue. The second issue is doing that in a high performance manner. In my response to Adamo above I touched on some of these issues, but I am not going to do a white paper about our technology in the comments here.

Regarding my examples it is true, almost by definition, that anything that is simple enough to explain in a short blog post is simple enough to do in a relational database. The idea, which most people seem to get, is about how these ideas scale. It is hard to add new object types and create new relationships in existing structures. And the more relationships you try to add the harder it gets. If you stop thinking about typical database scenarios, which are inherently limiting, and start thinking more about the infinite possibilities associated with the real world and, though we are not building a semantic web system, an RDF store, you may begin to at least be able to imagine where a relational model is not sufficiently abstracted to allow for free assignment of relationships.

I guess in conclusion I will just say that my goal with this piece was not to introduce a product, but to help identify a problem. I was curious if the way we view the world would resonate with people. Despite a few skeptical responses such as yours, the answer is a resounding yes. Having said that, I will probably take additional steps as we get closer to launch, to talk about the technology and the specific types of applications where we think that we are far better suited than relational databases. I hope you will stick around as I become more able to tell more of the pieces of the story.

Anonymous said...

Chris Date's book is hardly a beginner's text. I've been working with databases for 10 years, and I've got his book, but I haven't worked up the ambition to read it yet.

Date actually believes that SQL databases are very poor implementations of the pure relational model. The book is about the pure model, and how SQL falls short.

Anybody interested in reinventing databases would do well to read it, I would think. He's also got a much bigger and more rigorous book on the subject. Date was one of the originals - he worked with Cobb and helped invent the relational model in the first place.

I have read some of his articles and interviews on the web, and he has a lot of very interesting ideas, quite a bit different than standard sql databases. He's also working on an implementation, though it's unfortunately proprietary.

darose said...

I was just about to chime in with what Anonymous just said: Chris Date's books are not Databases 101. Although he worked with Codd (IIRC) when they were being developed, he largely writes about how SQL does not properly implement Codd's ideas for what the relational model actually is (we can all thank Larry Ellison for that), and that therefore all modern SQL databases, well, "suck". He proposes - on paper - an ideal database language (called Tutorial D) which would properly implement the relational concepts. (He provides no implementation; others have since made attempts.)

I don't personally agree with a lot of what Date says. (I largely find his opinions to be that of an ivory-tower academic who's just dogmatically pooh-pooh'ing the inevitable and necessary more practical and user-friendly implementation of the technology. "They're not *really* relational".) And, in fact, as you know, I'm currently hacking on a database project that explicitly *rejects* several of his key assertions.

Nevertheless, I think you very much misunderstood Greg's suggestion and came down way too hard on him as a result. Even if you eventually reject Date's ideas, it's probably in your best interests to take the time to understand them well first.

There's a lot of info that can be found on Date and Tutorial D through Wikipedia and Googling.

Anonymous said...

Anyone seriously proposing opinions without understanding/recognizing/citing Chris Date is probably ill-informed to the point of not having a useful contribution. If you stack-ranked all the books and papers on database theory, and especially relational theory, you wouldn't get deeper than 3-5 books without hitting Date.

It's interesting that the author asserts that not knowing a software package is not a big deal, but that recommending a book that would help him avoid continued public embarrassment is somehow offensive.

And if you didn't get the relevance of the Newton comment, you're either lying about having any developer-level contact, or don't really understand what you're talking about anyway.

Anonymous said...

Hank, you are trying to create a solution for a problem that doesn't exist. You will end up on the long list of shovelware producers who arose in the putrid "Web 2.0" wasteland.

ekzept said...

strip away his flamboyant claims and language, and IMO this is precisely the problem icon Ted Nelson is trying to solve with his ZigZag framework. you want the benefits of a relational database, but you need to be able to have unanticipated exceptions, without everything unravelling and without anything breaking.

agreed.

Eric Gonzalez said...

Interesting stuff Hank. Normalization beyond 3rd norm was always iffy in the real world I found, but seems that bar is being lowered, so to speak.
Your comment on removing foreign keys is well taken, but the problem is wider than that - the issue is that the crux of the RDB system is that the schema is fixed. So it's not just the FK that needs rethinking, but in fact the PK in some data sets needs changing. For example, will social nets ostensibly using PKs for labeling individual users need to change considering most millennials are discarding email as their primary communication medium?

This is just a stream of consciousness comment, mind you. Good food for thought, thanks.

Emil Eifrem said...

I hope this is not seen as a product pitch, but there's a commercially-backed open source project that implements a transactional graph database system in Java: http://neo4j.org. (Warning: Currently sparse documentation. Will be substantially improved over the next month or so.)

It is a from-scratch implementation (custom on-disk representation for high-speed traversals, graph-specific storage manager, etc) so there's no relational legacy (for good and for bad). It scales to billions of nodes on single-machine hardware.

I outlined some more thoughts in this comment.

Feel free to check it out! Would love to have a constructive discussion about its strengths / weaknesses as one particular incarnation of the graph database model.

Emil Eifrem
http://neotechnology.com

A Man said...

The thing is, a relational database just gives you a technology (and paradigm) to work with. You can build your graphs on top of it.

Consider linker.to, one of my pet projects. It's supposed to be an embodiment of Tim Berners-Lee's vision of the semantic web. It will actually let you ask questions like, "which actors STARRING IN films GROSSING OVER $200 million ARE FROM Australia?" (By the way, I think it's quite a few.)

It is able to do that because you just define the relational database schema in a really good way. It took me several hours of on-and-off thinking, over the course of a few weeks, to get it right. It's in a very nice state right now.

One thing I realized through all of this is that WE CAN DO BETTER THAN TAGS. Currently, all social bookmarking sites use tags like it's the new black. Here's a big problem, though: tags do not carry semantic meaning. When someone is tagged Spain, does that mean they are FROM Spain, they live IN Spain, or they write ABOUT Spain? Hemingway could be tagged Spain.

I decided to come up with a different design. I wanted to make sure it would be extensible to all sorts of things. It's not the be-all and end-all, but it's kind of unique. I wound up with something similar to object-oriented programming, though more "relational":

Tags represent set membership (Billy Joel is a Singer and a Songwriter and a Pianist, as well as a few other things. Singers, Songwriters and Pianists are all Musicians, who are also People. Songwriters might also be considered writers, to some extent.)

Actually I have three different types of tags: keywords (for searching & synonym purposes), categories (set membership) and flags (letting users manage the actions).

Now, for each category, items would get a set of Attributes that people could add. For example, a Person could have a mother, father, sisters, brothers, etc. They could have a height, a date of birth, and so forth. On the other hand, a certain camcorder would have a bunch of technical specs as attributes.

I had started out with several types of attributes, including amounts (ratings, prices), wiki (descriptions of things) and relationships (to other items, e.g. mother, father).

There's more, but this db schema lets you model pretty much all the items and ask meaningful questions. By the way, the "GREW UP IN" and "STARRING IN" I mentioned above are relationship attributes on items.

If someone wants to join me in building out this pet project, by the way, contact me: gregory$gregory&(net.

mattrepl said...

You might be interested in column-oriented databases (like the aforementioned C-store) that employ bitmap indices.

There's a Wikipedia entry on bitmap indexing and I recently wrote a bit about the need for this type of persistent storage when working with knowledge here.

John said...

From the point of view of a agile, PHP developer... =)

I've been developing PHP applications for almost a decade now, and recently got started with Pylons (Python MVC), Ruby on Rails, and CakePHP (PHP MVC).

I don't think it suffices to say that if we "dumb down" the semantic-ness then we will see better adoption rates. Instead I think there are a few things I would want to see as benefits as a web developer:


It's open source - I don't want to buy a license to get it. I'm agile, I use existing open source frameworks to get the job done. In a startup environment I don't have budget for anything with a large footprint. I need to bootstrap and save all the money I can for food, marketing, growth.

Examples - The system could have all the benefits I'd want, but I don't want (or may not be able to) go through the source code to figure out how specific methods work.

Show me the benefit - semantics, RDF, microformats, blah blah. They are thrown around a lot, but how can they benefit my startup? Well that's exactly what you need to show them in a simple way. Mashups are no longer uncommon, and open apis are becoming increasing available. I think we can all agree that we'd like to move to a web of infinite interoperability. We won't get there over night, but it will sure take a lot longer if future apps are being developed in silos. I want to expose my data and leverage/link together with yours to give my users more value.

Hosting support - Unless you are running your own servers, you can only use what is provided by your web hosting company. While this doesn't limit all developers, it will certain hamper the growth process initially. However, just like Ruby on Rails was once uncommon for the average hosting company to support (without throwing them extra $$$), once it picks up a community hosting companies will be sure to jump on the wagon. So imho this bullet will fall towards the bottom of the list

etc, etc...


Until something fulfills this list I will continue to use my RDB, but please let me be the first to know when I can start to leverage your new project.

Noah Slater said...

Have you checked out CouchDB:

http://en.wikipedia.org/wiki/CouchDB

It might be exactly what you're looking for. ;)

John Brennan said...

thanks for the suggestion Noah. i'm gonna play with it this weekend!

Tony Rogerson said...

That's why web developers shouldn't design schema's - you need a database designer (dba).

I don't see it as bad practice to have many database tables - if the entity being modelled requires it then so be it.

Databases store facts, SQL is used to answer questions so again you are down to the fact if you want to get the most out of your information you need a good database guy.

The market for business intelligence is huge and growing fast - that is the area where peopel are extracting "knowledge" from the database, the "knowledge" being the trend relationships between invoices and customers etc...

Tony.

Prateek said...

In addition to CouchDB, something that is not document oriented but instead still stores everything as memes:

Brainwave Poseidon Database

We've released v1.1 with v1.2 on the way with bug-fixes and performance improvements.

Anonymous said...

Or you could learn how to implement a database.

Rolf Veen said...

I'm happy to see that, finally, the idea comes through. Some years ago, when trying to fit a company's data into tables, I had this same thought; the relational model is obsolete, at least for this type of applications.

If you look for gknowledge in google you will find the spanish company that, now without me, continues to exploit the idea of a graph database called G.

From those days I retain a project called OGDL (ogdl.org), an XML alternative for representing textual and binary graphs, that could be the underlaying data format of a graph database, but I've no plans to build one (again), so I'm just waiting for a good open source initiative. I wish you success with yours.

Christian Busch said...

"all of your child’s friends for life"

can be perfectly stored in SQL. :)

Anonymous said...

Your article got me to thinkng. I love writing parsers, and what we really need is a better semantic query language. Relational databases became popular IMHO due to the relative ease with which it was possible to construct SQL statements.

We need a semantic query language that is easy to use.

you can contact me at steve at integrityintegrators dot net.

Post a Comment