Archive for September, 2008

Earl grey, hot!

Monday, September 22nd, 2008

Everybody has a different notion of what constitutes semantic search. The initial paper coining the term “semantic search” from Guha et al. describes a system that shows objects alongside document search results in a web search engine.

Recently, especially natural language processing outfits have hijacked the term: all of a sudden their 20+ year old research is “semantic”. Their great promise (which is also Powerset’s) is to have a computer answer your questions posed in natural language faster than you can say “whizbang”.

I think computers answering natural language questions is science fiction, and creates expectations that are impossible to fulfill, for now and the years to come. Just because it’ll be nice to have a system means that it’s possible to build. Sure, I’d like to tell the computer to replicate a hot beverage for my enjoyment. Does that mean scientists and engineers are able to build that system? Not in your lifetime, I’m afraid.

For now, the best one can do is matching keywords on an object graph, plus computing clever rankings and maybe point-and-click query refinement. It’ll be a while until the computer tells you the right answer to “what’s the meaning of life?”. Don’t hold your breath.

OpenID’s teething problems

Wednesday, September 10th, 2008

The idea behind OpenID sounds great. Create one account and re-use that account wherever you need to log in on the Web. Excellent idea, and very enticing because the system is completely decentralised and relies on basic Web technologies. Using Web tech is a great plus when you think about how tedious it can be inside an organisation to get access to an LDAP server because of firewall and external user policies and such.

Combine OpenID with FOAF and you get a completely decentralised social networking platform. Good stuff.

While I like the idea, the implementation side is still sketchy. There are two Java implementations: openid4java which requires you to include more than a dozen jars to be able to simply provide a login; the source archive has a whopping 74M, so I didn’t touch it. Luckily there’s joid, which is much smaller.

So joid is the jar of choice. Joid works fine with myopenid.com and verisign ids (it’s from the Verisign guys after all), but fails on livejournal ids. And, more annoyingly, the library doesn’t seem to support delegated ids. As the mailing list moderator is apparently dead, it means to wait another year or two until the kinks are ironed out.

I like being an early adopter. Really.

The timbl number

Thursday, September 4th, 2008

Mathematicians, boring as they are, have the cool Erdős number which measures how far away they are in the co-author graph from Paul Erdős, famous hobo mathematician. Actors have the Kevin Bacon number, which tells them how many steps they are away in the co-actor graph to Kevin Bacon, mediocre but apparently work-aholic actor.

In contrast, Web Science researchers have nothing more than the dubious honour of working in a field which needs to include “science” in its name, and on top of that have to struggle with the scruffy, chaotic, erroneous Web. Nothing too exciting here.

To make our dull work slightly more glamorous, I propose to introduce the “timbl number”, which tells people how many hops they are away in the foaf:knows graph from Web inventor and Semantic Web evangelist Tim Berners-Lee.

My timbl number recently dropped from three (via Richard Cyganiak) to two (via Christoph Bussler); I might be able to get another two-hop connection soon. My goal is to get a timbl number of one someday, i.e. Tim would state that he knows me in his FOAF file. Learn about the progress exclusively here on this blog!

State of the FOAF-sphere

Tuesday, September 2nd, 2008

The data quality on the Semantic Web improves. I’ve been crawling FOAF and RDF for a few years now, and the data available today is better, by leaps and bounds, than what it used to be. However, if the improvement continues at the current pace, it’ll be years before we get to something useful.

Building nice application on top of real-world data requires more or less connected data, i.e. shared use of URIs. Whilst schema-level URIs (in vocabularies such as FOAF and SIOC) are being used across many sources, instance-level agreement on URIs has still to happen.

While I prefer when different sources reuse common URIs to denote the same instance (a person, say), we’re smushing things based on OWL’s inverse functional properties. However, currently a lot of sources even don’t provide property values to smush on (e.g. friendfeed doesn’t provide homepage or email/email hashsum), which renders the current Semantic Web pretty much useless for real-world applications. Loads of islands and duplication of data, a grand mess.

Let’s hope over time sources provide keys that allow to fuse instance data from multiple sources, and people converge in their use of URIs. My URI, btw, is http://harth.org/andreas/foaf#ah if you want to add me to your FOAF file.