Introduction
Being the adventurous soul that you are, you have decided to publish some RDF on the Web. Again, thank you! However, on the road ahead lie some common obstacles and pitfalls, but fear not! In this section, we provide a comprehensive (albeit, certainly inexhaustive) list of common problems present in RDF publishing, how they arise, why they are problematic, and how they can be fixed. We will endeavour to speak plainly and objectively. We hope that you will find this list useful, and that after reading, you will be able to deftly sidestep these common pitfalls... And if you have fallen in, don't worry: in most cases, such issues are easy to fix (if you want).
Before we get down to the nitty-gritty, we feel the need to point out that many of the “problems” listed are not actually incorrect according to any of the pertinent standards. Our goal is not merely to enumerate possible contraventions of the various standards, but to highlight and discuss common issues on the Web which are problematic with respect to:
- the accessibility of a particular document;
- parsing the document;
- naming and dereferencability;
- interpretation of datatype literals;
- reasoning.
Again, to reassure, we are pedants not evangelists. If you intentionally include something which we deem to be a “problem”, and you are aware of the consequences, then no problem! The list presented is for education and reference, and not intended to be a “best-practices guide”.
And so we begin…
Table of Contents
- 1. Accessibility
- 2. Parsing and Syntax
- 3. Naming and Dereferencability
- 4. Datatype Literals
- 5. Reasoning
1. Accessibility
If you're publishing some RDF on the Web, the first step is to make sure that the data can be retrieved. The following errors relate to how a document is accessed on the Web, with particular reference to HTTP-related issues.
1.1 Document Not Retrievable
Simple: a document is not externally accessible on the Web… Not to dwell too much on the issue—and besides obvious causes such as the document being nonexistent—a publisher should ensure that the document is not an internal or local resource, that authentication is not required, and that the robots.txt
settings do not conflict with (at least) low-volume external access.
1.2 Incorrect Content-Type
Related to the above issue of content negotiation, a server returns the media type of the returned content by means of the Content-Type
field in the HTTP response header (cf. Section 14.17 of the HTTP specification). Again, the responding server should return the most specific media type which applies to the returned document format. The correct media types for various formats likely to be used around the Web of Data are:
Format | Media type |
---|---|
RDF/XML | application/rdf+xml |
Turtle | text/turtle |
N-Triples | text/plain |
N-Quads | text/x-nquads |
HTML | text/html |
XHTML | text/html or application/xhtml+xml |
XHTML with RDFa | application/xhtml+xml |
General JSON | application/json |
SPARQL Query Result XML format | application/sparql-results+xml |
SPARQL Query Result JSON format | application/sparql-results+json |
A frequent problem that should be avoided is the use of the generic XML media types text/xml or application/xml for specific XML formats that have their own media type, such as XHTML, RDF/XML, or SPARQL results.
1.3 Content Negotiation for the Sake of Cleverness
Beware! Content negotiation sounds great in theory, but it is a mess in practice. Implementing it correctly is surprisingly hard, or impossible if your server environment is restrictive, and it causes confusion to no end for people trying to access your server. So think twice before you use content negotiation, and avoid it if it's not necessary.
In practice, content negotiation is successfully used in the following scenarios:
- Redirecting users to country- or language-specific sub-sites based on their IP address.
- Delivering browser-specific versions of a site to work around browser incompatibilities, based on the browser's User-Agent header. (This is considered poor practice by Web standards advocates, but is common nonetheless.)
- Different translations of the same document are displayed based on the Accept-Language header. Browsers send different Accept-Language headers based on the settings chosen by the user in the browser or system preferences.
- Some REST-style Web APIs allow clients to choose between different data formats, such as XML and JSON, based on the client's Accept HTTP header. (The usefulness of this can be questioned, because clients are custom-built for those APIs, and that would be an easier task if the API would simply use different URIs for the different variants.)
- The Linked Data style of publishing RDF uses content negotiation to provide convenient human-readable versions of the published RDF data. Based on the Accept header, either RDF or HTML is delivered. Linked data browsers and other RDF clients have to send an appropriate Accept header to get the RDF.
If your use case is not one of the above, then don't bother with content negotiation. Just use different URIs (e.g., different file extensions) for your different variants. In your case, the advantages of content negotiation are purely theoretical, because there are no clients yet that can take advantage of it. Any new clients which are created specifically to work with your site are probably better off if you ask them to just access the different content at different URIs. So think twice before burdening your users with the complexity of content negotiation.
(One exception: If you design a standard protocol, which is to be used independently from any specific site, then you should consider content negotiation where appropriate. Hopefully you are firm in Web architecture!)
1.4 Content Negotiation between Inappropriate Variants
We will illustrate this problem with an example. Imagine a Web site that uses RDF to express incomplete metadata about its HTML pages. For each HTML page, say, /foo.html, there might be a corresponding RDF page at /foo.rdf that contains basic metadata (information about title, creator, creation date, and the like). But the RDF does not contain the actual main content of the HTML page. In this case, the RDF is not an appropriate alternate version of the HTML, because it does not contain the same information. Content negotiation between both variants from /foo would be inappropriate.
Another example—and one that does not involve RDF—is as follows: Imagine an important document, which is available in English, and in a Spanish translation. But the Spanish translation is not complete: the second half of the document is simply missing from the Spanish version. Again, content negotiation between both variants is inappropriate.
In general, content negotiation between different versions of the same content is only appropriate if all the variants contain the same information. Variances in format (e.g., JSON vs. XML), language, and quality (to some extent—e.g., pristine English words and a sloppy German translation), are acceptable. But if some variants give you more information then others, then content negotiation is harmful.
Why is it harmful? Because you are cloaking information from some clients without telling them. Let's say client A accesses the URI and finds highly relevant information. Client A sends the URI to client B. Client B accesses the URI. But because of content negotiation, B receives a different version. This version does not include the highly relevant bit. B will never know what happened. The result is a Web site that treats certain clients as second-class citizens by secretly withholding information from them.
But what if you cannot give the same information to all clients? What if parts of the information are unavailable for some version? What should be done about the half-translated document? What should be done about the RDF version that captures only a part of the information in the HTML version?
Firstly, do not use content-negotiation, but make the partial information available under a separate, independent URI. Secondly, if the complete resource is accessed by a client with different preference, just give them the default version anyway. The Spanish client should get the English version; the RDF client should get the HTML. The Spaniard will recognize that it's English, and the RDF client will know based on the Content-Type response header that it cannot parse the content. So the client does not get the important information it was expecting, but it can tell that it's because of a limitation in its own capability.
Alternatively, you could respond with a 406 Not Acceptable HTTP status code. If you design mostly for humans, then this is not a good idea, because humans are quite resourceful. The Spaniard might get an English-speaking friend to translate, or might find a link that you placed in the page which leads to the incomplete Spanish version. The 406 response is more appropriate in the case of APIs or RDF clients, who probably cannot do anything with formats other than those specified in their Accept headers, so you might just as well save some bandwidth and just tell them that there is no variant of this resource that would be useful to them.
1.5 Incorrect interpretation of the Accept Header
Content negotiation is often presented in a simplified way: “If the client sends X in the Accept HTTP header, then the server returns format X. If the client sends Y, then the server returns Y.” If you think that this is the whole story, then you are likely to implement content negotiation incorrectly.
Accept headers have a fairly complex syntax. In particular:
- Accept headers can include multiple media types, separated by comma. The following header would indicate that the client prefers either RDF/XML or Turtle: application/rdf+xml,text/turtle.
- Media types, such as text/html, can include additional parameters appended after a semicolon: text/html;charset=utf-8. It is often sufficient to just ignore the parameters.
- One parameter is of particular importance though: the quality parameter—also known as the q value. Clients use q values to indicate preference of some media types over others. In the following example, the client indicates that it prefers RDF/XML, but would also accept HTML with a lower preference: application/rdf+xml;q=1.0,text/html;q=0.4. Note that a q value of 1.0 is the default and can be omitted. If a server has both RDF/XML and HTML, it should return RDF/XML, because the client has indicated a higher preference.
In the common case of negotiating between RDF and an HTML rendering thereof, commonly observed problems include:
- not recognising media types if they include a parameter—e.g., text/html;charset=utf-8 or application/rdf+xml;q=0.9;
- always sending HTML when several media types are specified in the Accept header;
- always sending HTML when both RDF and HTML are in the Accept header, even if RDF has a higher q value;
- choosing between RDF and HTML based on which appears first (or last) in the Accept header, rather than based on their q values;
- redirecting to a nonexistent URI, such as something.rdf.html, when both RDF and HTML are in the Accept header.
In particular, clients that accept both RDF/XML and HTML (e.g., browser plugins and clients that support RDFa as well as RDF/XML) run into problems because of server implementation problems… So please make sure that your server is not guilty of any of the problems above!
If, for whatever reason, it is impossible to implement the full algorithm in your server environment, including q values, then an approximation will have to do. Here is a good one:
- If no Accept header is sent by the client, assume that the client wants raw data; i.e., RDF/XML. (This is probably an unsophisticated client that has not been properly written to actually emit an appropriate Accept header, and it's much more likely that such a client is a quickly hacked data processing script than an HTML-processing Web browser.)
- If a raw data format—such as application/rdf+xml—is mentioned, then send that format. (A client that can process HTML and RDF/XML can probably do more interesting things with the raw data, rather than its human-readable rendering.)
- In all other cases, send HTML. (It's probably a Web browser.)
Note, however, that the existence of such heuristics is no excuse for not implementing correct handling of q values. We may sometimes show our more sensitive and considerate side, but we are still pedants after all.
1.6 Content Negotiation with Missing Vary Header
Caches are essential to the efficient operation of the Web. HTTP caches sit between client and server, and store any cacheable server responses. When another client later on requests the same resource, then the cache may directly return the stored response. So the client receives a response without the origin server being hit at all. This can significantly reduce server load.
But for this to work, the cache has to know which responses are cacheable for what kinds of requests, and for how long. Servers can indicate this by using various HTTP headers in their responses.
Content negotiation and the Vary header. If a resource has multiple representations subject to content negotiation (e.g., it has an HTML representation and an RDF representation), then caches must be made aware of this. Otherwise they might return a cached HTML response to a client requesting RDF, not knowing that the server would handle these two requests differently.
To make caches aware of multiple representations, the server must include a Vary HTTP header with any response that is subject to content negotiation. The value of the Vary header is one or more names of other HTTP headers: the headers that the server uses to select a representation.
The typical case for content negotiation with RDF is that the Accept header is used to select the appropriate representation. Therefore, a Vary HTTP header like this has to be included in content-negotiated responses:
Vary: Accept
This will prevent caches from returning representations that were generated for a different Accept header, and will prevent hard-to-debug issues where a client inexplicably sees responses in an unexpected format.
2. Parsing and Syntax
RDF is a framework for representing data in a structured fashion that machines can consume; there are various concrete syntaxes for representing RDF, such as N3, N-Triples, Turtle, RDFa and RDF/XML. If RDF broadly defines the structure of the language in the form of triples, etc., then the syntaxes offer a grammar which states how to delimit parts of the language (using slashes, brackets, commas, special names, etc.) such that a machine can parse.
Here, we focus on the two most commonly used formats for RDF Web publishing; namely RDF/XML and RDFa. Part of the popularity of these formats can be attributed to their origin from two existing Web standards: resp. XML and XHTML. Indeed, syntax errors in these formats are relatively rare, with the presence of well-known syntactic validators: resp. the W3C RDF/XML validation service and the W3C Markup Validation service. Instead of enumerating all possible syntactic errors, in this section we focus on common misunderstandings in using RDF/XML and RDFa syntactic shortcuts such that are not syntactic errors (and will not be flagged by the corresponding validator), but will still result in parsing triples other than intended.
2.1 RDF/XML and RDFa: Ambiguous Base-URI
Just like in HTML, in certain RDF syntaxes use of relative URIs is allowed. This allows use of abbreviated names in the document which will be appended onto the base URI: usually determined as the URL from which the document is retrieved. Although XML (and thus RDF/XML and RDFa) allows specification of an unambiguous base URI, oftentimes, such a base URI is unspecified.
So what's the problem we hear you ask? Consider a document which can be retrieved from two different locations; e.g., http://example.org/doc.rdf
and http://www.example.org/doc.rdf
. This document uses relative URIs but doesn't explicitly specify a base URI. Now, an agent which accesses the document from both locations will resolve the relative URIs against different base URIs, with different resulting URIs. The agent will see the same resource—when identified by a relative URI—as two different resources with distinct URIs (one version with, and one version without the www.
).
Thus, unless you are sure that your base URI is unambiguous or you don't use relative URIs, we encourage use of the xml:base
construct to explicitly specify the base URI, and ultimately avoid confusion.
One other word of warning about base URIs: depending on the combination of the base URI and the relative URI being resolved against it, a parser may unexpectedly strip part of the base URI to create what it deems to be the intended full URI. For example:
- “
http://example.org/dangling/
” + “name
” =
“http://example.org/dangling/name
” - “
http://example.org/dangling
” + “name
” =
“http://example.org/name
” - “
http://example.org/dangling
” + “” =
“http://example.org/dangling
” - “
http://example.org/dangling
” + “/name
” =
“http://example.org/name
” - “
http://example.org/dangling/
” + “/name
” =
“http://example.org/name
” - “
http://example.org/dangling#
” + “name
” =
“http://example.org/name
” - “
http://example.org/dangling
” + “#name
” =
“http://example.org/dangling#name
” - “
http://example.org/dangling#
” + “#name
” =
“http://example.org/dangling#name
” - “
http://example.org/dangling#
” + “/name
” =
“http://example.org/name
”
The moral of the story here is to be careful if using relative URIs: ensure that your base URI is unambiguous and double-check that the URIs resolve as expected. Also, if using RDF/XML, be wary of the fact that rdf:ID
relative names have a different means of being resolved against base URIs…
2.2 RDF/XML: rdf:ID
/rdf:nodeID
/rdf:about
/rdf:resource
In RDF/XML, there are four constructs for identifying things: rdf:ID
, rdf:nodeID
, rdf:about
and rdf:resource
. Jumbling them up is surprisingly easy and can result in a document which although valid, represents something completely different from what you intended. We now briefly clarify the intended use of the four constructs, and then discuss some common mistakes and confusion:
rdf:about
: Used solely as an attribute on a "node element" to uniquely identify a resource by means of a URI. The URI can be specified in full, or as a relative URI which will be resolved against the in-scope base-URI.rdf:resource
: Used solely as an attribute on a "property element" to specify a URI value for an object. Similarly tordf:about
, the URI may be given in full, or as a relative URI which will be resolved against the in-scope base URI.rdf:ID
: Used as an attribute to provide unique relative XML names which will be appended onto the base URI. When used on a node element,rdf:ID="xmlname"
acts roughly likerdf:about="#xmlname"
; however,rdf:ID
values must be unique names and must be valid XML names. Can also be used on a "property element" to identify a reified statement (valid, but rare usage).rdf:nodeID
: When used on a "subject element", acts similarly tordf:ID
and provides unique names which are used to create blank-nodes instead of URIs. When used in the "property position", and allows for specifying blank-node objects.
Problems mainly arise when rdf:ID
is mistakenly used instead of rdf:about
, rdf:nodeID
or rdf:resource
; or indeed, vice-versa. Firstly, on node elements, and unlike rdf:about
, rdf:ID
values have a '#
' prepended. Secondly, when used on node elements, rdf:ID
creates URIs and rdf:nodeID
creates blank nodes. Thirdly, when used on property elements, rdf:ID
(unlike rdf:nodeID
) identifies a reified statement, and not the object of the property—to identify an object URI, rdf:resource
should be used.
Again, even though a validator may give your document the thumbs up, this is only an indication that the document can be parsed into triples, not necessarily that the document parses into the triples that you intended and with the names that you intended. You should also verify that the parsed triples are as expected, and that any relative URIs resolve as expected.
3. Naming and Dereferencability
In RDF, we name things, give things values for named properties, define named relations to other named things and organise named things into named classes; in RDF we use URIs as names, which enables dereferencing: the URI name of a resource can be accessed, with the expectation that an RDF document is returned with some description of the named resource. Now, instead of copying and pasting all information available about all resources named in your document (or exhaustively linking to other documents using, e.g., rdfs:seeAlso
), you can simply use the dereferencable URI which an agent can resolve for more information.
There are two “recipes” for creating dereferencable URIs: one uses hash-based URIs whereas the other uses slash-based URIs. The best-practices for both have been covered extensively in many documents, such as Best Practice Recipes for Publishing RDF Vocabularies and How to Publish Linked Data on the Web. To summarise here—and possibly over-simplifying—dereferencable hash-based URIs are best suited to group the descriptions of a small or moderate number of related terms into one document and one location, allowing an agent to retrieve the descriptions of multiple related terms with one HTTP lookup; dereferencable slash-based URIs are best suited to provide individual documents for each of a large number of terms, such that an agent will not need to download a massive document to find the description of one term. In any case, the choice of recipe is yours; we only wish to encourage use of dereferencable URIs where appropriate according to some best-practice such as enumerated above… for now, we list common problems you might encounter along the way…
3.1 Redirects Other Than 303
Redirects are often used to point from the URI of a non-information resource to the document which describes it; in particular the303 See Other
redirect is recommended. Although most agents will support other redirect schemes—such as 301 Moved Permanently
, or 302 Found
—the 303
redirect has been agreed upon as the most suitable for accessing resource descriptions and should be used...
4. Datatypes
At first, datatypes are a tricky beast. Datatypes are a means of classifying literals such that their value can be interpreted by a machine: they define how a specific class of literals should look and how they can be interpreted. The literal classes used in RDF are mostly borrowed from XML Schema and represent common types of values used to describe things on the Web, such as: integer, date, boolean, string. To keep things simple, we refine our scope to the set of built-in datatypes. However, the definition of these datatypes, and their interpretation in RDF, is replete with gotchas—even the Web-standards boffins sometimes disagree on how datatypes should be defined and handled. Panic not! We are here to help, and in this section we will enumerate some of the more commonly encountered problems in using dataypes.
4.1 Malformed Datatype Literals
If you try to tell a datatype-aware agent that A
is an integer
, that agent will disagree. Datatype classes have what is called a lexical representation which defines the sequences of characters which are allowed in a literal of that class. The lexical representations for all datatype classes are defined in XML Schema Part 2: Datatypes Second Edition. For example:
integer has a lexical representation consisting of a finite-length sequence of decimal digits (#x30-#x39) with an optional leading sign. If the sign is omitted, "+" is assumed. For example: -1, 0, 12678967543233, +100000.
From this, a datatype-aware agent will know that A
is not a valid integer
literal: in fact, asserting otherwise is an inconsistency. Although this example is fairly straightforward, many datatype classes have more complex lexical forms. In particular, the datatypes classes relating to date and time are subject to errors, the most common being dateTime
:
The lexical space of dateTime consists of finite-length sequences of characters of the form:
'-'? yyyy '-' mm '-' dd 'T' hh ':' mm ':' ss ('.' s+)? (zzzzzz)?
, where
- '-'? yyyy is a four-or-more digit optionally negative-signed numeral that represents the year; if more than four digits, leading zeros are prohibited, and '0000' is prohibited…;
- the remaining '-'s are separators between parts of the date portion;
- the first mm is a two-digit numeral that represents the month;
- dd is a two-digit numeral that represents the day;
- 'T' is a separator indicating that time-of-day follows;
- hh is a two-digit numeral that represents the hour; '24' is permitted if the minutes and seconds represented are zero, and the dateTime value so represented is the first instant of the following day (the hour property of a dateTime… cannot have a value greater than 23);
- ':' is a separator between parts of the time-of-day portion;
- the second mm is a two-digit numeral that represents the minute;
- ss is a two-integer-digit numeral that represents the whole seconds;
- '.' s+ (if present) represents the fractional seconds;
- zzzzzz (if present) represents the timezone (as described below).
For example, 2002-10-10T12:00:00-05:00 (noon on 10 October 2002, Central Daylight Savings Time as well as Eastern Standard Time in the U.S.) is 2002-10-10T17:00:00Z, five hours later than 2002-10-10T12:00:00Z.
The most common errors relating to dateTime
include the use plain text values such as 12:32 Feb 7 2008
, the omission of the mandatory seconds field, and the omission of :
delimiters.
4.2 Incompatibility with Range Datatype
Properties can have a defined range which states that a value of that property (object of a triple with that property in the predicate position) must be of a certain class. Not only can a range be a class of individuals (e.g., the knows
property has range Person
), but it can also be a datatype class (e.g., the lastModified
property has range dateTime
).
To understand how problems arise, we need to look a bit deeper into the interpretation of datatypes and datatype literals. Firstly, it is important to note that the class of plain literals without language tags (literals without a datatype or language tag) can be considered equivalent to the datatype class xsd:string
. Secondly, a literal cannot have both a datatype and a language tag (if you try to give a literal both a language tag and a datatype in RDF/XML, the language tag will most often be ignored). Thirdly, there are two types of XML Schema datatypes: primitive datatypes and derived datatypes where the latter are defined in terms of (derived from) a parent datatype; all derived datatypes have exactly one primitive datatype ancestor and a member of a derived datatype is also considered a member of all ancestor datatypes—to take an example (see here for full tree), nonPositiveInteger
is a datatype derived from integer
, which is in turn derived from decimal
; a member of nonPositiveInteger
is also a member of integer
and decimal
. Finally, all of the primitive XML Schema datatypes are disjoint from each other; this means that a literal cannot be a member of more than one primitive datatype (or, as it follows, of derived datatypes with different primitive datatype ancestors).
Okay, deep breath. Clearly, what we have from above is a recipe for confusion. Plain literals are xsd:string
s? … unless they have language tags? … in which case they cannot have datatypes? … and float
and decimal
are disjoint? …
Respectively, yes, yes, yes and yes. What is more, remember that properties can have datatypes defined as range. Now, everytime you use that property, you must ensure that you give a value whose datatype is compatible (not disjoint) with the defined range. One common misconception is that if the range of a property; e.g., lastModified
; is a certain datatype; e.g., xsd:dateTime
; and a plain literal value is given for that property; e.g., “2002-10-10T12:00:00Z”
; then that plain literal will be converted into a typed literal; e.g., “2002-10-10T12:00:00Z”^^<xsd:datetime>
. This is not so, and is in fact an inconsistency since the plain literal value is considered analogously to an xsd:string
which is disjoint with the property's range xsd:dateTime
.
The safest option to avoid such confusion is fairly straightforward: if the range of a property that you're using is a datatype, specifically type each value for that property using that exact datatype, and ensure that the value abides by the lexical form of that datatype; e.g., every time you use lastModified
, specify that the value is an xsd:dateTime
, and ensure that the value is a lexically valid xsd:dateTime
.
5. Reasoning
In RDF, like in many aspects of life, it seems that (i) the more complex something is, the greater the potential for it to go wrong; and (ii) the more powerful something is, the greater the potential for it to be noticeable when it does go wrong. This brings us neatly onto reasoning (in fact, fellow pedants will be outraged that we already snuck in some reasoning issues in the last section under the subterfuge of “datatypes”).
Although there exist standard entailment regimes and reasoning fragments, there does not exist a perfect solution to all problems. Hence, reasoning takes many forms and is many things to many people. We're not so much concerned that you'll make mistakes which will cause Description Logics aficionados to spill coffee over their pristine lab coats: the people over in FOAF have been doing that for years by, e.g., defining foaf:mbox_sha1sum
(a datatype-property) as inverse-functional—(how dare they!). Even if you're not worrying about reasoning examples involving the patricidal and incestuous destiny of Oedipus, we're on your side! Instead, in this section we hope to cover those aspects of reasoning which are of immediate importance to the Web.
5.1 Bogus Values for Inverse-Functional Properties
An inverse-functional property is a property whose value uniquely identifies a resource. Examples include properties such as ISBN codes for books, social security numbers for people, physical MAC addresses for devices, and so on. Inverse-functional properties are pretty handy on the Web: oftentimes people don't agree on what URI to use for a particular resource; e.g., a book; but as long as they give a consistent value for a consistent inverse-functional property; e.g., use the same ISBN property with the same value for that book; people don't have to agree on URIs and a reasoner will be able to conclude that it's the same book being discussed. In other words, since people don't always agree upon URIs, inverse-functional properties allow people to identify resources according to values already agreed upon (ISBNs, SSNs, MACs, etc.).
Great. So what's the problem? Well, unfortunately, publishers sometimes give nonsensical values for inverse-functional properties. The most common example of this is the FOAF inverse-functional property foaf:mbox_sha1sum
, intended to represent an encoded version of a person's email address, and defined to uniquely identify a person. This property is commonly instantiated—particularly from social networking exporters which externalise a public FOAF profile for each of their users—and is subsequently used to match descriptions of people across different sites and different URI naming schemes. Unfortunately however, many exporters do not bother to validate user-input correctly (e.g., allow users to leave email fields blank) and hence export bogus values for foaf:mbox_sha1sum
such as 08445a31a78661b5c746feff39a9db6e4e2cc5cf
and da39a3ee5e6b4b0d3255bfef95601890afd80709
; the former is the encoded sha1-sum of the string “mailto:”
and the latter is the sha1-sum of an empty string. A quick Google of the former value will reveal hundreds of thousands of results, which upon quick inspection, are mostly RDF files and values for foaf:mbox_sha1sum
. Now, a reasoner will interpret any individual with this value for foaf:mbox_sha1sum
as being the same person, resulting in what we call the “God Entity”: an omnipresent individual with hundreds of thousands of names, locations, friends, homepages, and so on.
The moral of the story? If using an inverse-functional property, ensure that the corresponding values are correct and do apply to the resource(s) in question. If creating or maintaining an exporter for RDF data based on user input, try to make sure that omitted input translates into omitted output, not blank or bogus data.
5.2 Inconsistencies
In the immortal words of ancient cartographers: HC SVNT DRACONES! Inconsistencies, simply put, are the assertion or implication that contradictory statements are true, or indeed, that a single impossible statement is true. Inconsistencies can take many forms and can occur for various reasons.
Inconsistencies can occur if trying to reconcile different world-views from different data publishers. For example, an atheist will assert that God
is a ImaginaryBeing
whereas a theist will assert that God
is a RealBeing
, although ImaginaryBeing
is clearly disjoint with RealBeing
. Such disagreement can occur even in more concrete domains: a botanist will assert that a Tomato
is a Fruit
whereas a taxman will tell you that a Tomato
is a Vegetable
and apply a tariff accordingly. Such inconsistencies are due to genuine disagreement between publishers and—with the danger here of getting more coffee stains on lab-coats—are not a bad thing at all and probably best left unresolved.
However, most inconsistencies currently found on the Web result directly from mistakes in RDF documents or disagreement on the identification of resources, and can be resolved. Also, almost all inconsistencies are caused by resources found to be members of disjoint classes. One of the most common causes is using a URI to describe two completely different things: e.g., using a person's homepage URI to identify both the homepage and the person (clearly, a resource cannot be both a homepage and a person). Another common cause is using a property or class on the basis of its label and not verifying that its semantics are suitable. For example, the somewhat generically named foaf:img
property is used to relate people to pictures they appear in, and so has its domain defined as foaf:Person
(thus, every resource described with a value for foaf:img
must be a foaf:Person
); however, publishers commonly use this property on anything from documents to countries, leading to inconsistencies.
Tracing and resolving the latter category of inconsistencies can be difficult, especially considering that multiple parties may be involved. However, one can help to avoid inconsistencies by double-checking the semantics of existing terms they wish to re-use, and by carefully choosing new URIs for identifying resources.