seco is a system to enable collaboration in online communities. It collects RDF data from the web, stores it in an index, and makes it accessible via a web interface. At the moment the system contains information about more than 7000 people and 2000 news items. This represents most of the information on the emerging semantic web in FOAF and RDF 1.0 vocabularies. This data has been created by a large number of people. The challenge is to tidy up this data and integrate it in a way that faclitates easy access and re-use.
There is a lot of data in "semantic web format" already available. Vocabularies such as RSS 1.0 and FOAF are widely used in the semantic web and weblog communities. However, there is no application to date that makes use of this data.
The goal of the application described here is to provide a large repository of structured information about people and newsitems. This data is easy to access, either by humans or machines. We aim at establishing a repository for trustworthy metadata on the semantic web.
The natural choice when dealing with RDF data is to store all data in the same format, to allow for easy conversion, access, and re-use. In this document, I describe briefly the architecture and the main features of such a system.
The components described in the following section have access to the application's ontology and either add, change, delete, or simply access the data stored there.
This component is a web crawler for RDF data. It takes an RDF file as starting point and follows rdfs:seeAlso links in it. The harvested data is then stored in a Jena2 repository. A large part of the existing semantic web is connected using this type of links.
Provenance information is stored in the repository to be able to delete or update data from a specific source. This is done using RDF's reification mechanism.
The source data is transformed using this component. In this step, the data is aligned according to the internal data model (for example, foaf:Person becomes sw:Person). Additionally, different instances are joined based on their uniqly identifing property. In the FOAF realm, two person instances that have the same email address are considered to describe the same person. The process of merging the data set according to these properties is called smushing.
To enable access to legacy data sources such as email (IMAP4 servers) there are wrappers available. The IMAP4 wrapper accesses the data store, fetches the data and returns it in RDF. In addition to the IMAP4 wrapper, there are wrappers for the Google API (SOAP) and RSS 0.92 and 2.0. A wrapper consists basically of a servlet that returns an RDF representation of the underlying datastore.
Using wrapper technology, it is possible to encapsulate standard web services and turn them into some sort of semantic web services. For the scutter, it makes then no difference whether the crawled site is originally in RDF or not. The wrapper makes all data sources equal and easy to handle and integrate.
The user interface is based on J2EE servlet technologies and accessible via standard browsers such as Internet Explorer or Mozilla, text-based browsers such as lynx, and mobile devices. This functionality is enabled by adhering to W3C's XHTML and CSS2 standards.
For constructing the user interface from the ontology, the following steps are conducted:
Usage information is collected when users follow a link to a given newsitem or person. This data is used for ranking purposes, and will be exploited for providing advanced personalization for the site.
There is a notion of workflow support in the application. The application support mulitple users with different roles. There are "users" and "editors", that can carry out a different set of actions. An editor for example can post the most interesting items on the home page by a single click.
The application is 100% pure Java. The system uses current technologies such as Unicode, XHTML, CSS2, XML, RDF, and emerging standards such as OWL. Everything is stored in RDF using HP's Jena2 toolkit.
It's a clean, flexible, lightweight semantic web application built from scratch using open standards.
This application barely scratches the surface of what will be possible when enough quality data is publicly available. There are a lot challenges involved in this types of applications. One major issue is how to maintain data integrity when huge number of people with different backgrounds contribute data for the semantic web.
This submission to the Semantic Web Challenge presented a scalable FOAF and RSS repository using state of the art web technologies. FOAF and RSS are format that are deployed widely in the weblog and semantic web community. This application makes use of this data by aggregating and rearranging this data. I hope that this application encourages people to participate in the effort to make large quantities of RDF instance data available.
I'd like to thank the people on #rdfig for comments, suggestions, and code which I use in the application. I am especially grateful to Matt Biddulph for making his scutter code available.