Monday, May 12, 2008

Media Darling Powerset vs. Non-Media Darling Hakia

Over the weekend, The web was abuzz with discussion about Microsoft considering the acquisition of natural language search company Powerset. Today there is yet more coverage. Some time ago I had heard a rumor that someone was looking at Powerset, but was relatively uninterested. Hearing that the potential acquirer is Microsoft certainly makes it more interesting, but I have to say the concept leaves me more than a bit incredulous.

From skeptic to user

I became familiar with Powerset's only competitor, Hakia initially because they are a New York company. I became intrigued with Hakia because several months ago I tried their search engine, and it worked – really well. This was a surprising result for me since I have always been a skeptic regarding all things relating to artificial intelligence, speech recognition, natural language processing, and other such fuzzy technologies.

At least in the area of natural language processing Hakia that has changed my mind. In fact, it has become common for me to use the Hakia search engine when Google does not deliver sufficient results.

Hakia and Powerset are part of the same general area of natural language search. The idea with both services is that you can actually ask specific questions and get answers. But there are critical differences between Hakia and Powerset. And those differences bring me back to my incredulity at the idea that Microsoft is taking a serious look at Powerset.

Powerset indexes 750 times slower than Hakia!

I have no expertise in natural language processing or semantic search, or any type of full text search for that matter. But as far as I can tell, Hakia’s technology is *far* superior to that of Powerset’s. Why would I say that?

Well first, as I have already said, it works. It is a real live search engine. I use it. I can’t say the same for Powerset. Powerset has yet to show anything but a search engine for Wikipedia. A big part of the reason Powerset doesn’t seem able to offer a real search engine is the fact that according to their own reports, it takes them about 25 seconds to index a page, based on an average of 25 sentences per page. According to Hakia it takes them 1/30th of a second to index a page. Essentially this means that Powerset cannot scale. It is seven hundred fifty times slower than Hakia!

Now you might assume that Powerset is slower because it’s applying some serious, and superior indexing mojo, and therefore what it is doing is much more valuable than what Hakia is doing. But alas that is also not true.

Hakia really knows how to read

Hakia is doing something called “ontological semantics”. What this means is that over the last four years, Hakia has developed an “ontology” for human expression. In layman terms, what this means is that what Hakia does when it indexes a page is to look at each sentence and figure out what the *questions* are that each sentence answers. Any given sentence usually answers 3 or 4 questions. These questions are coded and go into what Hakia calls their Qdex, or question index.

In order to be able to figure out what the relevant questions are for a given sentence, Hakia’s indexer has to literally read the sentence. By “read” I mean it has to understand the actual meaning of the sentence semantically. This is a big deal.

Powerset uses statistics + syntax but can’t actually read

So, while Hakia is actually reading, Powerset, does not actually attempt to understand what sentences mean. It uses a system that parses the syntax of the sentence and guesses matches based on statistics. But this approach means that for questions that do not match previously encountered syntactical patterns, the system will not be able to find answers, even if there are in fact answers in the database.

Powerset benefits from the Silicon Valley echo chamber

Now, if, for a moment, you presume that it is true, or even *possibly* true that Hakia is the superior service and technology, or if you even assume that Hakia is just equivalent to Powerset, why would Powerset be so continuously celebrated while Hakia is overshadowed?

The only answer I can come up with is that the west coast is such an echo chamber that very little sound gets in or out. And so it must be shocking when a New York company develops a technology that seems to beat the pants off something that should be pure Silicon Valley. Just a thought.

In any case, it seems, for the record, worth noting that we have the clear leader in natural language processing and search technology right here. And, as an admitted New York partisan, after a while it does get a little annoying to hear such continued fawning over a west coast company that is very likely, at the end of the day, just another Silicon Valley also-ran.

9 comments:

.mike said...

Great post, as usual, Hank. As a fellow east-coaster, I've gotta say the Silicon Valley echo chamber effect is definitely frustrating at times.

BTW, I just tried Hakia for the first time, and I'm pretty impressed. Just for yucks, I asked it some really technical questions about configuring JNDI on application servers, and (in my very subjective opinion) it did a better job then Google.

innonate said...

I've been using Hakia more and more these days... when I have questions I think NLP will do better with. I'm not sure how to articulate why or when I turn to Hakia over Google... but when I do, I've been rather pleased. I'll think about this more and do a follow up post to yours, Hank.

Anonymous said...

Well I think it’s really hard to compare a search engine to another base on what happened behind the curtain. Search engine technology is a really complex matter. Any speculation done from the surface is probably a guessing game.
I will agree though that Hakia seems to be much closer to what they promise to deliver that Powerset. Hakia may have a bit of semantic flavor but remain overall a poor search engine. I always wonder what Powerset is doing with all the money they have raised. I would felt terribly disappointed if I have given them my money. Building a search engine for Wikipedia (not even a good one) with all that money is a little short.

I will take the opportunity here to express my reserve on semantic search. If semantic search is define as a search engines that answer questions, here are two reasons why I think it is not a very promising way for search:

1.It is hard for people in general to type an entire question, users are generally lazy and anything that makes them think is not potentially good. 2.The language factor. Though the web is mostly written in English it will be a challenge for these companies to implement a semantic search in every language. From English to French there’s a whole new world.

Mark Johnson said...

Powerset is trying to foster a community among semantic technology companies and we'd love to see Hakia get more play in the press. However, I wanted to clear up some misconceptions in your article.

We read pages at index time, not at query time. Therefore, the speed of indexing a page is a red herring.

Also, I'm not sure what you mean when you say that Powerset "does not actually attempt to understand what sentences mean." A syntactic parse of a sentence is critical to understanding its meaning. Additionally, Powerset does have information about words that are also used to determine meaning.

As an example, try Powerset vs. Hakia for queries like "who did texaco acquire" vs. "who acquired texaco." Notice that, not only does Powerset get the answer right in both cases, but notice that we match words like "bought" and "purchase." Again, I don't want to seem like I'm knocking Hakia, but we're really doing different things.

I'd love to give you some more detail if you're interested.

Hank Williams said...

Mark,

I am happy to chat with you. But to be clear, *all* search engines read pages at index time not search time. The significance of speed is that the reason you are not able to do the whole web and only wikipedia is because you index very slowly. This means with your current algorithms you will require 750x more computing power. Essentially it means you will *never* be able to index the web at anything like a reasonable cost, at least until processors come down radically in price.

Regarding the statement that you do not understand meaning, yes, you of course must read the syntax of a sentence. This is baseline. And your syntactic analysis is more sophisticated than Hakia's. But you do not maintain an ontology. Your strategy is statistical. I am not saying that this will not yield good results. But I do believe that ultimately you will be no better, and are likely over time to be substantially worse with a statistically driven strategy vs. an ontological strategy. This piece is of course my opinion. The issue about the scaling is not at all opinion, just math.

I also think your assertion that you are doing different things is specious. It is true (you are indexing wikipedia and they are indexing the web - or a least far more of the web) but at the end of the day we are talking about question based search engines. You guys have what some consider to be a very slick user interface. But I don't think you can get away from comparison by saying you do different things.

garrytan said...

Great points. In a world where great technology always trumps all, Hakia will win as t -> infinity. But branding, marketing, user experience, and overall desirability of a product do matter. And Powerset has an edge.

Matt said...

Of course it sounds great to understand text instead of 'just' gathering statistical information about how words are used. It fits with our folk understanding of how we humans understand language. I believe that at one time, part-of-speech tagging was sorta considering a semantic task, and it was handled by rule-based parsers. But then statistical parsers came and completely solved the task. Statistical methods have worked their way up the food-chain, so to speak, and are now used to handle named-entity recognition, semantic role labeling, and question answering. I don't know the ultimate limits of these techniques, but they've continually exceeded expected limitations.

The term 'semantics' is tossed around casually. It's a moving target -- various tasks are thrown up as proxies for understanding of language -- but when they're solved we realize they don't really transfer to other real, unrelated problems. Let's be clear, none of these systems truly "understand" the content of the document. Extracting factoids from documents with some morphological processing and synonym substitution is not it. We've been trained to ask certain types of simple questions, and for that they 'understand enough'. But consider deeper questions that require more complex analysis of the question and synthesis of content from a variety of source:

"Assuming another terrorist attack occurs on US soil, is it more likely to be a conventional attack or an attack using a weapon of mass destruction?"

"What is the current status of India’s Prithvi ballistic missile project?"
(taken from http://acl.ldc.upenn.edu/hlt-naacl2004/qa/ps/hickl-interactive.ps)

Obviously neither Hakia or PowerSet deliver this now. It's not clear to me which of them is more likely to get there first.

Marc Doucette said...

It's only fair to let the program speak for itself...

Q: "Does Powerscore suck?"

A: "Hasse Diagram", "Martian Manhunter", "Tank", "Carnivorous Plants", "Sunspot"...

Um, right...

Marc D, again said...

Oh fuck, I typo'd "Powerset" as "Powerscore" in the comment... But that was the actual things returned for searching "Does POWERSET suck?", ha ha

Post a Comment