Ontology mapping with owl:sameAs property

GraphDB owl:sameAs optimization is used for mapping the same concepts from two or more datasets, where each of these concepts can have different features and relations to other concepts. In this way, making a union between such datasets provides more complete data. In RDF, concepts are represented with a unique resource name by using a namespace, which is different for every dataset. Therefore, it is more useful to unify all names of a single concept, so that when querying data, you are able to work with concepts rather than names (i.e., IRIs).

For example, when merging 4 different datasets, you can use the following query on DBpedia to select everything about Sofia:

SELECT * {
  {
    <http://dbpedia.org/resource/Sofia> ?p ?o .
  }
  UNION
  {
    <http://data.nytimes.com/nytimes:N82091399958465550531> ?p ?o .
  }
  UNION
  {
    <http://sws.geonames.org/727011/> ?p ?o .
  }
  UNION
  {
    <http://rdf.freebase.com/ns/m/0ftjx> ?p ?o .
  }
}

Or you can even use a shorter one:

SELECT * {
  ?s ?p ?o
  FILTER (?s IN (
    <http://dbpedia.org/resource/Sofia>,
    <http://data.nytimes.com/nytimes:N82091399958465550531>,
    <http://sws.geonames.org/727011/>,
    <http://rdf.freebase.com/ns/m/0ftjx>))
}

As you can see, here Sofia appears with 4 different URIs, although they denote the same concept. Of course, this is a very simple query. Sofia has also relations to other entities in thesesta datasets, such as Plovdiv, i.e., <[http://dbpedia.org/resource/Plovdiv]>, <[http://sws.geonames.org/653987/]>, <[http://rdf.freebase.com/ns/m/1aihge]>.

What’s more, not only the different instances of one concept have multiple names but their properties also appear with many names. Some of them are specific for a given dataset (e.g., GeoNames has longitude and latitude, while DBpedia provides wikilinks) but there are class hierarchies, labels and other common properties used by most of the datasets.

This means that even for the simplest query you may have to write the following:

SELECT * {
  ?s ?p1 ?x .
  ?x ?p2 ?o .
  FILTER (?s IN (
    <http://dbpedia.org/resource/Sofia>,
    <http://data.nytimes.com/nytimes:N82091399958465550531>,
    <http://sws.geonames.org/727011/>,
    <http://rdf.freebase.com/ns/m/0ftjx>))
  FILTER (?p1 IN (
    <http://dbpedia.org/property/wikilink>,
    <http://sws.geonames.org/p/relatesTo>))
  FILTER (?p2 IN (
    <http://dbpedia.org/property/wikilink>,
    <http://sws.geonames.org/p/relatesTo>))
  FILTER (?o IN (<http://dbpedia.org/resource/Plovdiv>,
    <http://sws.geonames.org/653987/>,
    <http://rdf.freebase.com/ns/m/1aihge>))
}

But if you can say through rules and assertions that given URIs are the same, then you can simply write:

SELECT * {
  <http://dbpedia.org/resource/Sofia> <http://sws.geonames.org/p/relatesTo> ?x .
  ?x <http://sws.geonames.org/p/relatesTo> <http://dbpedia.org/resource/Plovdiv> .
}

If you link two nodes with owl:sameAs, the statements that appear with the first node’s subject, predicate and object will be copied, replacing respectively the subject, predicate and the object that appear with the second node.

For example, given that <[http://dbpedia.org/resource/Sofia]> owl:sameAs <[http://data.nytimes.com/N82091399958465550531]> and also that:

<http://dbpedia.org/resource/Sofia> a <http://dbpedia.org/resource/Populated_place> .
<http://data.nytimes.com/N82091399958465550531> a <http://www.opengis.net/gml/_Feature> .
<http://dbpedia.org/resource/Plovdiv> <http://dbpedia.org/property/wikilink> <http://dbpedia.org/resource/Sofia> .

then you can conclude with the given rules that:

<http://dbpedia.org/resource/Sofia> a <http://www.opengis.net/gml/_Feature> .
<http://data.nytimes.com/N82091399958465550531> a <http://dbpedia.org/resource/Populated_place> .
<http://dbpedia.org/resource/Plovdiv> <http://dbpedia.org/property/wikilink> <http://data.nytimes.com/N82091399958465550531> .

The challenge with owl:sameAs is that when there are many ‘mappings’ of nodes between datasets, and especially when big chains of owl:sameAs appear, it becomes inefficient. owl:sameAs is defined as Symmetric and Transitive, so given that A sameAs B sameAs C, it also follows that A sameAs A, A sameAs C, B sameAs A, B sameAs B, C sameAs A, C sameAs B, C sameAs C. If you have such a chain with N nodes, then N^2 owl:sameAs statements will be produced (including the explicit N-1 owl:sameAs statements that produce the chain). Also, the owl:sameAs rules will copy the statements with these nodes N times, given that each statement contains only one node from the chain and the other nodes are not sameAs anything. But you can also have a statement <S P O> where S sameAs Sx, P sameAs Py, O sameAs Oz, where the owl:sameAs statements for S are K, for P are L and for O are M, yielding K*L*M statement copies overall.

Therefore, instead of using these simple rules and axioms for owl:sameAs (actually 2 axioms that state that it is Symmetric and Transitive), GraphDB offers an effective non-rule implementation, i.e., the owl:sameAs support is hard-coded. The given rules are commented out in the PIE files and are left only as a reference.