seco - Integration Site for Semantic Web Metadata

http://seco.semanticweb.org/

Abstract

seco is a system to enable collaboration in online communities. It collects RDF data from the web, stores it in an index, and makes it accessible via a web interface. At the moment the system contains information about more than 7000 people and 2000 news items. This represents most of the information on the emerging semantic web in FOAF and RDF 1.0 vocabularies. This data has been created by a large number of people. The challenge is to tidy up this data and integrate it in a way that faclitates easy access and re-use.

Introduction

There is a lot of data in "semantic web format" already available. Vocabularies such as RSS 1.0 and FOAF are widely used in the semantic web and weblog communities. However, there is no application to date that makes use of this data.

The goal of the application described here is to provide a large repository of structured information about people and newsitems. This data is easy to access, either by humans or machines. We aim at establishing a repository for trustworthy metadata on the semantic web.

The natural choice when dealing with RDF data is to store all data in the same format, to allow for easy conversion, access, and re-use. In this document, I describe briefly the architecture and the main features of such a system.

Architecture

The components described in the following section have access to the application's ontology and either add, change, delete, or simply access the data stored there.

Scutter

This component is a web crawler for RDF data. It takes an RDF file as starting point and follows rdfs:seeAlso links in it. The harvested data is then stored in a Jena2 repository. A large part of the existing semantic web is connected using this type of links.

Provenance information is stored in the repository to be able to delete or update data from a specific source. This is done using RDF's reification mechanism.

Transformer

The source data is transformed using this component. In this step, the data is aligned according to the internal data model (for example, foaf:Person becomes sw:Person). Additionally, different instances are joined based on their uniqly identifing property. In the FOAF realm, two person instances that have the same email address are considered to describe the same person. The process of merging the data set according to these properties is called smushing.

Wrappers

To enable access to legacy data sources such as email (IMAP4 servers) there are wrappers available. The IMAP4 wrapper accesses the data store, fetches the data and returns it in RDF. In addition to the IMAP4 wrapper, there are wrappers for the Google API (SOAP) and RSS 0.92 and 2.0. A wrapper consists basically of a servlet that returns an RDF representation of the underlying datastore.

Using wrapper technology, it is possible to encapsulate standard web services and turn them into some sort of semantic web services. For the scutter, it makes then no difference whether the crawled site is originally in RDF or not. The wrapper makes all data sources equal and easy to handle and integrate.

User Interface

The user interface is based on J2EE servlet technologies and accessible via standard browsers such as Internet Explorer or Mozilla, text-based browsers such as lynx, and mobile devices. This functionality is enabled by adhering to W3C's XHTML and CSS2 standards.

For constructing the user interface from the ontology, the following steps are conducted:

  1. query the repository using RDQL
  2. translate the results into an XML representation
  3. apply an XSLT stylesheet to the XML results
  4. serve the resulting HTML

Usage information is collected when users follow a link to a given newsitem or person. This data is used for ranking purposes, and will be exploited for providing advanced personalization for the site.

There is a notion of workflow support in the application. The application support mulitple users with different roles. There are "users" and "editors", that can carry out a different set of actions. An editor for example can post the most interesting items on the home page by a single click.

Technologies Used

The application is 100% pure Java. The system uses current technologies such as Unicode, XHTML, CSS2, XML, RDF, and emerging standards such as OWL. Everything is stored in RDF using HP's Jena2 toolkit.

It's a clean, flexible, lightweight semantic web application built from scratch using open standards.

Main Features

Distributed information sources:
information is collected from hundreds of Internet sites
Integration of heterogeneous data:
using different RDF vocabularies (news, chatlogs, foaf), exploiting www-rdf-interest for names and e-mail addresses
Contains real world data:
try it yourself and search for your name!
The information is never complete:
crawling is carried out continously
Structured format description:
the meaning of the data is encoded in OWL

Additional Features

Using data sources in another way than originally intended:
Google, www-rdf-interest
Using the contents of multi-media documents:
pictures from FOAF homepages
Accessibility in multiple languages:
available in English and German
Accessibility via devices other than the PC:
use your PDA and connect wirelessly
Other applications than pure information retrieval:
editor/admin can add items to the home page
Workflow support:
admins can make newsitems available on the front page
The results are accurate:
ranking is conducted according to site usage information
The application is scalable:
currently about 200.000 statements, 2000 newsitems, and more than 7000 persons

Conclusion

This application barely scratches the surface of what will be possible when enough quality data is publicly available. There are a lot challenges involved in this types of applications. One major issue is how to maintain data integrity when huge number of people with different backgrounds contribute data for the semantic web.

This submission to the Semantic Web Challenge presented a scalable FOAF and RSS repository using state of the art web technologies. FOAF and RSS are format that are deployed widely in the weblog and semantic web community. This application makes use of this data by aggregating and rearranging this data. I hope that this application encourages people to participate in the effort to make large quantities of RDF instance data available.

I'd like to thank the people on #rdfig for comments, suggestions, and code which I use in the application. I am especially grateful to Matt Biddulph for making his scutter code available.


Andreas Harth
Last modified: Wed Oct 1 09:37:47 PDT 2003