Frequently Observed Problems
on the Web of Data

Aidan Hogan and Richard Cyganiak

Introduction

Being the adventurous soul that you are, you have decided to publish some RDF on the Web. Again, thank you! However, on the road ahead lie some common obstacles and pitfalls, but fear not! In this section, we provide a comprehensive (albeit, certainly inexhaustive) list of common problems present in RDF publishing, how they arise, why they are problematic, and how they can be fixed. We will endeavour to speak plainly and objectively. We hope that you will find this list useful, and that after reading, you will be able to deftly sidestep these common pitfalls... And if you have fallen in, don't worry: in most cases, such issues are easy to fix (if you want).

Before we get down to the nitty-gritty, we feel the need to point out that many of the “problems” listed are not actually incorrect according to any of the pertinent standards. Our goal is not merely to enumerate possible contraventions of the various standards, but to highlight and discuss common issues on the Web which are problematic with respect to:

  1. the accessibility of a particular document;
  2. parsing the document;
  3. naming and dereferencability;
  4. interpretation of datatype literals;
  5. reasoning.

Again, to reassure, we are pedants not evangelists. If you intentionally include something which we deem to be a “problem”, and you are aware of the consequences, then no problem! The list presented is for education and reference, and not intended to be a “best-practices guide”.

And so we begin…

Table of Contents

1. Accessibility

If you're publishing some RDF on the Web, the first step is to make sure that the data can be retrieved. The following errors relate to how a document is accessed on the Web, with particular reference to HTTP-related issues.

1.1 Document Not Retrievable

Simple: a document is not externally accessible on the Web… Not to dwell too much on the issue—and besides obvious causes such as the document being nonexistent—a publisher should ensure that the document is not an internal or local resource, that authentication is not required, and that the robots.txt settings do not conflict with (at least) low-volume external access.

1.2 Incorrect Content-Type

Related to the above issue of content negotiation, a server returns the media type of the returned content by means of the Content-Type field in the HTTP response header (cf. Section 14.17 of the HTTP specification). Again, the responding server should return the most specific media type which applies to the returned document format. The correct media types for various formats likely to be used around the Web of Data are:

FormatMedia type
RDF/XML application/rdf+xml
Turtle text/turtle
N-Triples text/plain
N-Quads text/x-nquads
HTML text/html
XHTML text/html or application/xhtml+xml
XHTML with RDFa application/xhtml+xml
General JSON application/json
SPARQL Query Result XML format application/sparql-results+xml
SPARQL Query Result JSON format application/sparql-results+json

A frequent problem that should be avoided is the use of the generic XML media types text/xml or application/xml for specific XML formats that have their own media type, such as XHTML, RDF/XML, or SPARQL results.

1.3 Content Negotiation for the Sake of Cleverness

Beware! Content negotiation sounds great in theory, but it is a mess in practice. Implementing it correctly is surprisingly hard, or impossible if your server environment is restrictive, and it causes confusion to no end for people trying to access your server. So think twice before you use content negotiation, and avoid it if it's not necessary.

In practice, content negotiation is successfully used in the following scenarios:

If your use case is not one of the above, then don't bother with content negotiation. Just use different URIs (e.g., different file extensions) for your different variants. In your case, the advantages of content negotiation are purely theoretical, because there are no clients yet that can take advantage of it. Any new clients which are created specifically to work with your site are probably better off if you ask them to just access the different content at different URIs. So think twice before burdening your users with the complexity of content negotiation.

(One exception: If you design a standard protocol, which is to be used independently from any specific site, then you should consider content negotiation where appropriate. Hopefully you are firm in Web architecture!)

1.4 Content Negotiation between Inappropriate Variants

We will illustrate this problem with an example. Imagine a Web site that uses RDF to express incomplete metadata about its HTML pages. For each HTML page, say, /foo.html, there might be a corresponding RDF page at /foo.rdf that contains basic metadata (information about title, creator, creation date, and the like). But the RDF does not contain the actual main content of the HTML page. In this case, the RDF is not an appropriate alternate version of the HTML, because it does not contain the same information. Content negotiation between both variants from /foo would be inappropriate.

Another example—and one that does not involve RDF—is as follows: Imagine an important document, which is available in English, and in a Spanish translation. But the Spanish translation is not complete: the second half of the document is simply missing from the Spanish version. Again, content negotiation between both variants is inappropriate.

In general, content negotiation between different versions of the same content is only appropriate if all the variants contain the same information. Variances in format (e.g., JSON vs. XML), language, and quality (to some extent—e.g., pristine English words and a sloppy German translation), are acceptable. But if some variants give you more information then others, then content negotiation is harmful.

Why is it harmful? Because you are cloaking information from some clients without telling them. Let's say client A accesses the URI and finds highly relevant information. Client A sends the URI to client B. Client B accesses the URI. But because of content negotiation, B receives a different version. This version does not include the highly relevant bit. B will never know what happened. The result is a Web site that treats certain clients as second-class citizens by secretly withholding information from them.

But what if you cannot give the same information to all clients? What if parts of the information are unavailable for some version? What should be done about the half-translated document? What should be done about the RDF version that captures only a part of the information in the HTML version?

Firstly, do not use content-negotiation, but make the partial information available under a separate, independent URI. Secondly, if the complete resource is accessed by a client with different preference, just give them the default version anyway. The Spanish client should get the English version; the RDF client should get the HTML. The Spaniard will recognize that it's English, and the RDF client will know based on the Content-Type response header that it cannot parse the content. So the client does not get the important information it was expecting, but it can tell that it's because of a limitation in its own capability.

Alternatively, you could respond with a 406 Not Acceptable HTTP status code. If you design mostly for humans, then this is not a good idea, because humans are quite resourceful. The Spaniard might get an English-speaking friend to translate, or might find a link that you placed in the page which leads to the incomplete Spanish version. The 406 response is more appropriate in the case of APIs or RDF clients, who probably cannot do anything with formats other than those specified in their Accept headers, so you might just as well save some bandwidth and just tell them that there is no variant of this resource that would be useful to them.

1.5 Incorrect interpretation of the Accept Header

Content negotiation is often presented in a simplified way: “If the client sends X in the Accept HTTP header, then the server returns format X. If the client sends Y, then the server returns Y.” If you think that this is the whole story, then you are likely to implement content negotiation incorrectly.

Accept headers have a fairly complex syntax. In particular:

In the common case of negotiating between RDF and an HTML rendering thereof, commonly observed problems include:

In particular, clients that accept both RDF/XML and HTML (e.g., browser plugins and clients that support RDFa as well as RDF/XML) run into problems because of server implementation problems… So please make sure that your server is not guilty of any of the problems above!

If, for whatever reason, it is impossible to implement the full algorithm in your server environment, including q values, then an approximation will have to do. Here is a good one:

  1. If no Accept header is sent by the client, assume that the client wants raw data; i.e., RDF/XML. (This is probably an unsophisticated client that has not been properly written to actually emit an appropriate Accept header, and it's much more likely that such a client is a quickly hacked data processing script than an HTML-processing Web browser.)
  2. If a raw data format—such as application/rdf+xml—is mentioned, then send that format. (A client that can process HTML and RDF/XML can probably do more interesting things with the raw data, rather than its human-readable rendering.)
  3. In all other cases, send HTML. (It's probably a Web browser.)

Note, however, that the existence of such heuristics is no excuse for not implementing correct handling of q values. We may sometimes show our more sensitive and considerate side, but we are still pedants after all.

1.6 Content Negotiation with Missing Vary Header

Caches are essential to the efficient operation of the Web. HTTP caches sit between client and server, and store any cacheable server responses. When another client later on requests the same resource, then the cache may directly return the stored response. So the client receives a response without the origin server being hit at all. This can significantly reduce server load.

But for this to work, the cache has to know which responses are cacheable for what kinds of requests, and for how long. Servers can indicate this by using various HTTP headers in their responses.

Content negotiation and the Vary header. If a resource has multiple representations subject to content negotiation (e.g., it has an HTML representation and an RDF representation), then caches must be made aware of this. Otherwise they might return a cached HTML response to a client requesting RDF, not knowing that the server would handle these two requests differently.

To make caches aware of multiple representations, the server must include a Vary HTTP header with any response that is subject to content negotiation. The value of the Vary header is one or more names of other HTTP headers: the headers that the server uses to select a representation.

The typical case for content negotiation with RDF is that the Accept header is used to select the appropriate representation. Therefore, a Vary HTTP header like this has to be included in content-negotiated responses:

Vary: Accept

This will prevent caches from returning representations that were generated for a different Accept header, and will prevent hard-to-debug issues where a client inexplicably sees responses in an unexpected format.

2. Parsing and Syntax

RDF is a framework for representing data in a structured fashion that machines can consume; there are various concrete syntaxes for representing RDF, such as N3, N-Triples, Turtle, RDFa and RDF/XML. If RDF broadly defines the structure of the language in the form of triples, etc., then the syntaxes offer a grammar which states how to delimit parts of the language (using slashes, brackets, commas, special names, etc.) such that a machine can parse.

Here, we focus on the two most commonly used formats for RDF Web publishing; namely RDF/XML and RDFa. Part of the popularity of these formats can be attributed to their origin from two existing Web standards: resp. XML and XHTML. Indeed, syntax errors in these formats are relatively rare, with the presence of well-known syntactic validators: resp. the W3C RDF/XML validation service and the W3C Markup Validation service. Instead of enumerating all possible syntactic errors, in this section we focus on common misunderstandings in using RDF/XML and RDFa syntactic shortcuts such that are not syntactic errors (and will not be flagged by the corresponding validator), but will still result in parsing triples other than intended.

2.1 RDF/XML and RDFa: Ambiguous Base-URI

Just like in HTML, in certain RDF syntaxes use of relative URIs is allowed. This allows use of abbreviated names in the document which will be appended onto the base URI: usually determined as the URL from which the document is retrieved. Although XML (and thus RDF/XML and RDFa) allows specification of an unambiguous base URI, oftentimes, such a base URI is unspecified.

So what's the problem we hear you ask? Consider a document which can be retrieved from two different locations; e.g., http://example.org/doc.rdf and http://www.example.org/doc.rdf. This document uses relative URIs but doesn't explicitly specify a base URI. Now, an agent which accesses the document from both locations will resolve the relative URIs against different base URIs, with different resulting URIs. The agent will see the same resource—when identified by a relative URI—as two different resources with distinct URIs (one version with, and one version without the www.).

Thus, unless you are sure that your base URI is unambiguous or you don't use relative URIs, we encourage use of the xml:base construct to explicitly specify the base URI, and ultimately avoid confusion.

One other word of warning about base URIs: depending on the combination of the base URI and the relative URI being resolved against it, a parser may unexpectedly strip part of the base URI to create what it deems to be the intended full URI. For example:

The moral of the story here is to be careful if using relative URIs: ensure that your base URI is unambiguous and double-check that the URIs resolve as expected. Also, if using RDF/XML, be wary of the fact that rdf:ID relative names have a different means of being resolved against base URIs…

2.2 RDF/XML: rdf:ID/rdf:nodeID/rdf:about/rdf:resource

In RDF/XML, there are four constructs for identifying things: rdf:ID, rdf:nodeID, rdf:about and rdf:resource. Jumbling them up is surprisingly easy and can result in a document which although valid, represents something completely different from what you intended. We now briefly clarify the intended use of the four constructs, and then discuss some common mistakes and confusion:

Problems mainly arise when rdf:ID is mistakenly used instead of rdf:about, rdf:nodeID or rdf:resource; or indeed, vice-versa. Firstly, on node elements, and unlike rdf:about, rdf:ID values have a '#' prepended. Secondly, when used on node elements, rdf:ID creates URIs and rdf:nodeID creates blank nodes. Thirdly, when used on property elements, rdf:ID (unlike rdf:nodeID) identifies a reified statement, and not the object of the property—to identify an object URI, rdf:resource should be used.

Again, even though a validator may give your document the thumbs up, this is only an indication that the document can be parsed into triples, not necessarily that the document parses into the triples that you intended and with the names that you intended. You should also verify that the parsed triples are as expected, and that any relative URIs resolve as expected.

3. Naming and Dereferencability

In RDF, we name things, give things values for named properties, define named relations to other named things and organise named things into named classes; in RDF we use URIs as names, which enables dereferencing: the URI name of a resource can be accessed, with the expectation that an RDF document is returned with some description of the named resource. Now, instead of copying and pasting all information available about all resources named in your document (or exhaustively linking to other documents using, e.g., rdfs:seeAlso), you can simply use the dereferencable URI which an agent can resolve for more information.

There are two “recipes” for creating dereferencable URIs: one uses hash-based URIs whereas the other uses slash-based URIs. The best-practices for both have been covered extensively in many documents, such as Best Practice Recipes for Publishing RDF Vocabularies and How to Publish Linked Data on the Web. To summarise here—and possibly over-simplifying—dereferencable hash-based URIs are best suited to group the descriptions of a small or moderate number of related terms into one document and one location, allowing an agent to retrieve the descriptions of multiple related terms with one HTTP lookup; dereferencable slash-based URIs are best suited to provide individual documents for each of a large number of terms, such that an agent will not need to download a massive document to find the description of one term. In any case, the choice of recipe is yours; we only wish to encourage use of dereferencable URIs where appropriate according to some best-practice such as enumerated above… for now, we list common problems you might encounter along the way…

3.1 Redirects Other Than 303

Redirects are often used to point from the URI of a non-information resource to the document which describes it; in particular the 303 See Other redirect is recommended. Although most agents will support other redirect schemes—such as 301 Moved Permanently, or 302 Found—the 303 redirect has been agreed upon as the most suitable for accessing resource descriptions and should be used...

4. Datatypes

At first, datatypes are a tricky beast. Datatypes are a means of classifying literals such that their value can be interpreted by a machine: they define how a specific class of literals should look and how they can be interpreted. The literal classes used in RDF are mostly borrowed from XML Schema and represent common types of values used to describe things on the Web, such as: integer, date, boolean, string. To keep things simple, we refine our scope to the set of built-in datatypes. However, the definition of these datatypes, and their interpretation in RDF, is replete with gotchas—even the Web-standards boffins sometimes disagree on how datatypes should be defined and handled. Panic not! We are here to help, and in this section we will enumerate some of the more commonly encountered problems in using dataypes.

4.1 Malformed Datatype Literals

If you try to tell a datatype-aware agent that A is an integer, that agent will disagree. Datatype classes have what is called a lexical representation which defines the sequences of characters which are allowed in a literal of that class. The lexical representations for all datatype classes are defined in XML Schema Part 2: Datatypes Second Edition. For example:

integer has a lexical representation consisting of a finite-length sequence of decimal digits (#x30-#x39) with an optional leading sign. If the sign is omitted, "+" is assumed. For example: -1, 0, 12678967543233, +100000.

From this, a datatype-aware agent will know that A is not a valid integer literal: in fact, asserting otherwise is an inconsistency. Although this example is fairly straightforward, many datatype classes have more complex lexical forms. In particular, the datatypes classes relating to date and time are subject to errors, the most common being dateTime:

The lexical space of dateTime consists of finite-length sequences of characters of the form: '-'? yyyy '-' mm '-' dd 'T' hh ':' mm ':' ss ('.' s+)? (zzzzzz)?, where

For example, 2002-10-10T12:00:00-05:00 (noon on 10 October 2002, Central Daylight Savings Time as well as Eastern Standard Time in the U.S.) is 2002-10-10T17:00:00Z, five hours later than 2002-10-10T12:00:00Z.

The most common errors relating to dateTime include the use plain text values such as 12:32 Feb 7 2008, the omission of the mandatory seconds field, and the omission of : delimiters.

4.2 Incompatibility with Range Datatype

Properties can have a defined range which states that a value of that property (object of a triple with that property in the predicate position) must be of a certain class. Not only can a range be a class of individuals (e.g., the knows property has range Person), but it can also be a datatype class (e.g., the lastModified property has range dateTime).

To understand how problems arise, we need to look a bit deeper into the interpretation of datatypes and datatype literals. Firstly, it is important to note that the class of plain literals without language tags (literals without a datatype or language tag) can be considered equivalent to the datatype class xsd:string. Secondly, a literal cannot have both a datatype and a language tag (if you try to give a literal both a language tag and a datatype in RDF/XML, the language tag will most often be ignored). Thirdly, there are two types of XML Schema datatypes: primitive datatypes and derived datatypes where the latter are defined in terms of (derived from) a parent datatype; all derived datatypes have exactly one primitive datatype ancestor and a member of a derived datatype is also considered a member of all ancestor datatypes—to take an example (see here for full tree), nonPositiveInteger is a datatype derived from integer, which is in turn derived from decimal; a member of nonPositiveInteger is also a member of integer and decimal. Finally, all of the primitive XML Schema datatypes are disjoint from each other; this means that a literal cannot be a member of more than one primitive datatype (or, as it follows, of derived datatypes with different primitive datatype ancestors).

Okay, deep breath. Clearly, what we have from above is a recipe for confusion. Plain literals are xsd:strings? … unless they have language tags? … in which case they cannot have datatypes? … and float and decimal are disjoint? …

Respectively, yes, yes, yes and yes. What is more, remember that properties can have datatypes defined as range. Now, everytime you use that property, you must ensure that you give a value whose datatype is compatible (not disjoint) with the defined range. One common misconception is that if the range of a property; e.g., lastModified; is a certain datatype; e.g., xsd:dateTime; and a plain literal value is given for that property; e.g., “2002-10-10T12:00:00Z”; then that plain literal will be converted into a typed literal; e.g., “2002-10-10T12:00:00Z”^^<xsd:datetime>. This is not so, and is in fact an inconsistency since the plain literal value is considered analogously to an xsd:string which is disjoint with the property's range xsd:dateTime.

The safest option to avoid such confusion is fairly straightforward: if the range of a property that you're using is a datatype, specifically type each value for that property using that exact datatype, and ensure that the value abides by the lexical form of that datatype; e.g., every time you use lastModified, specify that the value is an xsd:dateTime, and ensure that the value is a lexically valid xsd:dateTime.

5. Reasoning

In RDF, like in many aspects of life, it seems that (i) the more complex something is, the greater the potential for it to go wrong; and (ii) the more powerful something is, the greater the potential for it to be noticeable when it does go wrong. This brings us neatly onto reasoning (in fact, fellow pedants will be outraged that we already snuck in some reasoning issues in the last section under the subterfuge of “datatypes”).

Although there exist standard entailment regimes and reasoning fragments, there does not exist a perfect solution to all problems. Hence, reasoning takes many forms and is many things to many people. We're not so much concerned that you'll make mistakes which will cause Description Logics aficionados to spill coffee over their pristine lab coats: the people over in FOAF have been doing that for years by, e.g., defining foaf:mbox_sha1sum (a datatype-property) as inverse-functional—(how dare they!). Even if you're not worrying about reasoning examples involving the patricidal and incestuous destiny of Oedipus, we're on your side! Instead, in this section we hope to cover those aspects of reasoning which are of immediate importance to the Web.

5.1 Bogus Values for Inverse-Functional Properties

An inverse-functional property is a property whose value uniquely identifies a resource. Examples include properties such as ISBN codes for books, social security numbers for people, physical MAC addresses for devices, and so on. Inverse-functional properties are pretty handy on the Web: oftentimes people don't agree on what URI to use for a particular resource; e.g., a book; but as long as they give a consistent value for a consistent inverse-functional property; e.g., use the same ISBN property with the same value for that book; people don't have to agree on URIs and a reasoner will be able to conclude that it's the same book being discussed. In other words, since people don't always agree upon URIs, inverse-functional properties allow people to identify resources according to values already agreed upon (ISBNs, SSNs, MACs, etc.).

Great. So what's the problem? Well, unfortunately, publishers sometimes give nonsensical values for inverse-functional properties. The most common example of this is the FOAF inverse-functional property foaf:mbox_sha1sum, intended to represent an encoded version of a person's email address, and defined to uniquely identify a person. This property is commonly instantiated—particularly from social networking exporters which externalise a public FOAF profile for each of their users—and is subsequently used to match descriptions of people across different sites and different URI naming schemes. Unfortunately however, many exporters do not bother to validate user-input correctly (e.g., allow users to leave email fields blank) and hence export bogus values for foaf:mbox_sha1sum such as 08445a31a78661b5c746feff39a9db6e4e2cc5cf and da39a3ee5e6b4b0d3255bfef95601890afd80709; the former is the encoded sha1-sum of the string “mailto:” and the latter is the sha1-sum of an empty string. A quick Google of the former value will reveal hundreds of thousands of results, which upon quick inspection, are mostly RDF files and values for foaf:mbox_sha1sum. Now, a reasoner will interpret any individual with this value for foaf:mbox_sha1sum as being the same person, resulting in what we call the “God Entity”: an omnipresent individual with hundreds of thousands of names, locations, friends, homepages, and so on.

The moral of the story? If using an inverse-functional property, ensure that the corresponding values are correct and do apply to the resource(s) in question. If creating or maintaining an exporter for RDF data based on user input, try to make sure that omitted input translates into omitted output, not blank or bogus data.

5.2 Inconsistencies

In the immortal words of ancient cartographers: HC SVNT DRACONES! Inconsistencies, simply put, are the assertion or implication that contradictory statements are true, or indeed, that a single impossible statement is true. Inconsistencies can take many forms and can occur for various reasons.

Inconsistencies can occur if trying to reconcile different world-views from different data publishers. For example, an atheist will assert that God is a ImaginaryBeing whereas a theist will assert that God is a RealBeing, although ImaginaryBeing is clearly disjoint with RealBeing. Such disagreement can occur even in more concrete domains: a botanist will assert that a Tomato is a Fruit whereas a taxman will tell you that a Tomato is a Vegetable and apply a tariff accordingly. Such inconsistencies are due to genuine disagreement between publishers and—with the danger here of getting more coffee stains on lab-coats—are not a bad thing at all and probably best left unresolved.

However, most inconsistencies currently found on the Web result directly from mistakes in RDF documents or disagreement on the identification of resources, and can be resolved. Also, almost all inconsistencies are caused by resources found to be members of disjoint classes. One of the most common causes is using a URI to describe two completely different things: e.g., using a person's homepage URI to identify both the homepage and the person (clearly, a resource cannot be both a homepage and a person). Another common cause is using a property or class on the basis of its label and not verifying that its semantics are suitable. For example, the somewhat generically named foaf:img property is used to relate people to pictures they appear in, and so has its domain defined as foaf:Person (thus, every resource described with a value for foaf:img must be a foaf:Person); however, publishers commonly use this property on anything from documents to countries, leading to inconsistencies.

Tracing and resolving the latter category of inconsistencies can be difficult, especially considering that multiple parties may be involved. However, one can help to avoid inconsistencies by double-checking the semantics of existing terms they wish to re-use, and by carefully choosing new URIs for identifying resources.