A Build System for Web Data

December 12th, 2008

Systems for graphically constructing processing pipelines for web data are mushrooming. [Your company name here] Pipes. These type of systems make for very nice and impressive demos, however, are utterly useless for power users who need to process sizable amounts of web data, let’s say on dataset sizes from 1MB to 10GB (the “ETL scenario” if you’re a bit older).

Dealing with data from the web requires all sorts of cleansing and pre-processing steps, and cobbling the various steps together is a major pain if you want to build practical systems. We currently create “processing pipelines” (to use a fancy expression) with bash scripts, however, error handling and resuming build operations are tricky. Error handling is important since Web data is notoriously nosiy, and *will* kill your system at one point. Resuming build operations is important because on large datasets, you don’t want to process the entire pipeline if just the last processing step has changed.

Thankfully there are already systems out there for processing source code (such as make or Ant), and these systems could be easily adapted to handle data as well. For me, Ant would be the system of choice since it’s relatively compact and thus easy to learn, and includes a variety of build primitives for file manipulation already. What’s now needed are Ant tasks that do things such as parsing, scraping, cleansing, schema and instance matching. reasoning and what not, all steps necessary for preprocessing RDF. Using a common platform (Ant) would enormously simplify the integration of various Java-based RDF processing tools.

Ego Linking Part I

November 17th, 2008

Now’s the time for a major update of your FOAF file! (If you don’t have one already use FOAF-a-matic to create an initial version to start with)

URIs for things are much more common now than they were in the pre-http-range-14 days, when blank nodes were en vogue. So, you now may find the following URIs denoting people, or containing information about them:

  • http://dblp.l3s.de/d2r/resource/authors/Firstname_Lastname
  • http://semanticweb.org/id/Firstname_Lastname
  • http://data.semanticweb.org/person/firstname-lastname
  • http://tools.opiumfield.com/twitter/nick/rdf exports data about http://twitter.com/nick (but confuses the Person with the Document and following a person with knowing one)
  • http://friendfeed.com/nick has direct FOAF export (but uses bnodes for people)
  • http://dbpedia.org/resource/Firstname_Lastname (if you are really really famous)

The problem: each URI is separate, and information about the same real-world entity may be connected to multiple identifiers.
OWL provides a number of mechanisms for inferring equality: inverse functional properties (to establish equality on the same values for properties, e.g. SSN, passport number), owl:sameAs (direct equality), and a few more (functional properties and cardinality constraints for example, but that’s a story for another day).

Inverse functional property reasoning doesn’t work too well currently since the data is too nosiy (a lot of “unique” property values are “n/a”, “”, “yes”, “mbox:”, and so on, which are not unique at all), which leads to many bogus inferences.

So for now, I suggest to add the respective person URI via owl:sameAs predicates to your FOAF URI, which enables data aggregators to fuse all information about a person into a single view.

Even before you publish data about something, it might be a good idea to check if there’s already a URI for that thing. A quick search on SWSE can help.

How to build a semantic search engine…

October 18th, 2008

…if you have scores of developers at your disposal.

Let’s imagine you are a senior manager of a large software company. The online advertisement market is huge, and your boss tells you your company deserves at least a quarter of the online revenue pie. The company currently has less than ten percent. So what are you going to do?

Well, first, check out what’s hot on the Web at the moment. Semantic Web! Right, that’s what it has to be. So, you buy a hot semantic search startup - only to figure out later that open domain language-independent natural language technology is just not there yet (who was this Jeeves guy again?).

But fear not, you have enough resources at your disposal - and you were trained in all the relational database magic, the cure to all evil. So: your developers now create “ontologies” which will be the basis for your APIs. In your world, ontology is just the fancy word for a database schema, of course. Ah, chaos ensues, because at first nobody is reusing other people’s things.

To exercise a bit more control, now only ontologists (I really like that word) can create schema, and the chief ontologist will create the Great Ontology. To keep everything tidy, you’ll coerce instance data from the web into your data warehouse (based on, say, AsterBase or DataAllegro). There are plenty of engineers busy writing extraction rules, designing and maintaining schemas for the database, partitioning the databases and creating indices, and dreaming up nifty user interfaces. For each domain.

Mission accomplished! Because you’re using relational technology, your pilots, demos and prototypes seem to work (except keyword search and ranking), and your boss is happy. You are happy, too, because all your people have work and stay busy because of the high amount of manual labour involved. Another great day in paradise.

Earl grey, hot!

September 22nd, 2008

Everybody has a different notion of what constitutes semantic search. The initial paper coining the term “semantic search” from Guha et al. describes a system that shows objects alongside document search results in a web search engine.

Recently, especially natural language processing outfits have hijacked the term: all of a sudden their 20+ year old research is “semantic”. Their great promise (which is also Powerset’s) is to have a computer answer your questions posed in natural language faster than you can say “whizbang”.

I think computers answering natural language questions is science fiction, and creates expectations that are impossible to fulfill, for now and the years to come. Just because it’ll be nice to have a system means that it’s possible to build. Sure, I’d like to tell the computer to replicate a hot beverage for my enjoyment. Does that mean scientists and engineers are able to build that system? Not in your lifetime, I’m afraid.

For now, the best one can do is matching keywords on an object graph, plus computing clever rankings and maybe point-and-click query refinement. It’ll be a while until the computer tells you the right answer to “what’s the meaning of life?”. Don’t hold your breath.

OpenID’s teething problems

September 10th, 2008

The idea behind OpenID sounds great. Create one account and re-use that account wherever you need to log in on the Web. Excellent idea, and very enticing because the system is completely decentralised and relies on basic Web technologies. Using Web tech is a great plus when you think about how tedious it can be inside an organisation to get access to an LDAP server because of firewall and external user policies and such.

Combine OpenID with FOAF and you get a completely decentralised social networking platform. Good stuff.

While I like the idea, the implementation side is still sketchy. There are two Java implementations: openid4java which requires you to include more than a dozen jars to be able to simply provide a login; the source archive has a whopping 74M, so I didn’t touch it. Luckily there’s joid, which is much smaller.

So joid is the jar of choice. Joid works fine with myopenid.com and verisign ids (it’s from the Verisign guys after all), but fails on livejournal ids. And, more annoyingly, the library doesn’t seem to support delegated ids. As the mailing list moderator is apparently dead, it means to wait another year or two until the kinks are ironed out.

I like being an early adopter. Really.

The timbl number

September 4th, 2008

Mathematicians, boring as they are, have the cool Erdős number which measures how far away they are in the co-author graph from Paul Erdős, famous hobo mathematician. Actors have the Kevin Bacon number, which tells them how many steps they are away in the co-actor graph to Kevin Bacon, mediocre but apparently work-aholic actor.

In contrast, Web Science researchers have nothing more than the dubious honour of working in a field which needs to include “science” in its name, and on top of that have to struggle with the scruffy, chaotic, erroneous Web. Nothing too exciting here.

To make our dull work slightly more glamorous, I propose to introduce the “timbl number”, which tells people how many hops they are away in the foaf:knows graph from Web inventor and Semantic Web evangelist Tim Berners-Lee.

My timbl number recently dropped from three (via Richard Cyganiak) to two (via Christoph Bussler); I might be able to get another two-hop connection soon. My goal is to get a timbl number of one someday, i.e. Tim would state that he knows me in his FOAF file. Learn about the progress exclusively here on this blog!

State of the FOAF-sphere

September 2nd, 2008

The data quality on the Semantic Web improves. I’ve been crawling FOAF and RDF for a few years now, and the data available today is better, by leaps and bounds, than what it used to be. However, if the improvement continues at the current pace, it’ll be years before we get to something useful.

Building nice application on top of real-world data requires more or less connected data, i.e. shared use of URIs. Whilst schema-level URIs (in vocabularies such as FOAF and SIOC) are being used across many sources, instance-level agreement on URIs has still to happen.

While I prefer when different sources reuse common URIs to denote the same instance (a person, say), we’re smushing things based on OWL’s inverse functional properties. However, currently a lot of sources even don’t provide property values to smush on (e.g. friendfeed doesn’t provide homepage or email/email hashsum), which renders the current Semantic Web pretty much useless for real-world applications. Loads of islands and duplication of data, a grand mess.

Let’s hope over time sources provide keys that allow to fuse instance data from multiple sources, and people converge in their use of URIs. My URI, btw, is http://harth.org/andreas/foaf#ah if you want to add me to your FOAF file.

YARS2

August 17th, 2008

I’ve been getting a few inquiries regarding YARS2, our federated RDF repository. It’s really cool to see interest from top-notch AI and scientific computing research institutes in the US and Germany, and from large companies in the pharmaceutical area.

Most of the people expect the system to be open source; currently YARS2 is closed source but we are rethinking that decision.

Although I consider YARS2 a stable product, and my main focus right now is to finish my Ph.D. thesis, we still incorporate optimisations into the YARS2 codebase, especially functionality required by the reasoning module and the faceted user interface.

Faceted Browsing and the Semantic Web

August 15th, 2008

It looks like everybody and their dog is writing faceted browsers these days.

Benjamin Nowack announced one a few days ago on the crunchbase mailing list, and David Huynh announced a “novel” way of browsing graph structured data based on sets on the swig mailing list.

Good that Michiel Hildebrand noted that set-based browsing existed before, in /facet and Eyal Oren’s work. Our own SWSE system, in fact, had set-based focus change in its first incarnation more than a year ago.

We removed that functionality for the current SWSE interface, because users didn’t seem to get what’s going on. But it looks like there’s some new design ideas to build an interaction model that is intuitive. Or we just have to be more picky about our users…