RDF* and SPARQL*

The modeling challenge

RDF is an abstract knowledge representation model that does not differentiate data from metadata. This prevents the extension of an existing model with statement-level metadata annotations like certainty scores, weights, temporal restrictions, and provenance information like if this was a manually modified annotation. Several approaches discussed on this page mitigate the inherent lack of native support for such annotations in RDF. However, they all have certain advantages and disadvantages, which we will look at below.

Standard reification

Reification means expressing an abstract construct with the existing concrete methods supported by the language. The RDF specification sets a standard vocabulary for representing references to statements like:

:man :hasSpouse :woman .
:id1 rdf:type rdf:Statement ;
    rdf:subject :man ;
    rdf:predicate :hasSpouse ;
    rdf:object :woman ;
    :startDate "2020-02-11"^^xsd:date .

Standard reification requires stating four additional triples to refer to the triple for which we want to provide metadata. The subject of these four additional triples has to be a new identifier (IRI or blank node), which later on may be used for providing the metadata. The existence of a reference to a triple does not automatically assert it. The main advantage of this method is the standard support by every RDF store. Its disadvantages are the inefficiency related to exchanging or persisting the RDF data and the cumbersome syntax to access and match the corresponding four reification triples.

N-ary relations

The approach for representing N-ary relations in RDF is to model it via a new relationship concept that connects all arguments like:

:Marriage1 rdf:type :Marriage ;
    :partner1 :man ;
    :partner2 :woman ;
    :startDate "2020-02-11"^^xsd:date .

The approach is similar to standard reification, but it adopts a schema specific to the domain model that is presumably understood by its consumers. The only disadvantage here is that this approach increases the ontology model complexity and is proven difficult to evolve models in a backward compatible way.

Singleton properties

Singleton properties are a hacky way to introduce statement identifiers as a part of the predicate like:

:man :hasSpouse#1 :woman .
:hasSpouse#1 :startDate "2020-02-11"^^xsd:date .

The local name of the predicate after the # encodes a unique identifier. The approach is compact for exchanging data since it uses only two statements, but is highly inefficient for querying data. A query to return all :hasSpouse links must parse all predicate values with a regular expression.

Warning

GraphDB supports singleton properties in a reasonably inefficient way. The database expects the number of unique predicates to be much smaller than the total number of statements. Our recommendation is to avoid this modeling approach for models with significant size.

Named graphs

The named graph approach is a variation of the singleton properties, where a unique value on the named graph position identifies the statement like:

:man :hasSpouse :woman :statementId#1 .
:statementId#1 :startDate "2020-02-11"^^xsd:date :metadata .

The approach has multiple advantages over the singleton properties and eliminates the need for regular expression parsing. A significant drawback is the overload of the named graph parameter with an identifier instead of the file or source that produced the triple. The updates based on the triple source become more complicated and cumbersome to maintain.

Tip

If a repository stores a large number of named graphs, make sure to enable the context indexes.

RDF* and SPARQL*

RDF* is an extension of the RDF 1.1 standard that proposes a more efficient reification serialization syntax. The main advantages of this representation include reduced document size that increases the efficiency of data exchange, as well as shorter SPARQL queries for improved comprehensibility.

:man :hasSpouse :woman .
<<:man :hasSpouse :woman>> :startDate "2020-02-11"^^xsd:date .

The RDF* extension captures the notion of an embedded triple by enclosing the referenced triple using the strings << and >>. The embedded triples, like the blank nodes, may take a subject and object position only, and their meaning is aligned to the semantics of the standard reification, but using a much more efficient serialization syntax. To simplify the querying of the embedded triples, the paper extends the query syntax with SPARQL* enabling queries like:

# List all metadata for the given reference to a statement
SELECT *
WHERE {
    <<:man :hasSpouse :woman>> ?p ?o
}

The embedded triple in SPARQL* also supports free variables for retrieving a list of reference statements:

# List all metadata for the given reference to a statement
SELECT *
WHERE {
    <<?man :hasSpouse :woman>> ?p ?o
    FILTER (?man = :man)
}

How the different approaches compare?

To test the different approaches, we benchmark a subset of Wikidata, whose data model heavily uses statement-level metadata. The authors of the paper Reifying RDF: What works well with Wikidata? have done an excellent job with remodeling the dataset in various formats, and kindly shared with our team the output datasets. According to their modeling approach, the dataset includes:

Modeling approach

Total statements

Loading time (min)

Repository image size (MB)

Standard reification

391,652,270

52.4

36,768

N-ary relations

334,571,877

50.6

34,519

Named graphs

277,478,521

56

35,146

RDF*

220,375,702

34

22,465

We did not test the singleton properties approach due to the high number of unique predicates.

Syntax and examples

The section provides more in-depth details on how GraphDB implements the RDF*/SPARQL* syntax. Let’s say we have a statement like the one above, together with the metadata fact that we are 90% certain about this statement. The RDF* syntax allows us to represent both the data and the metadata by using an embedded triple as follows:

<<:man :hasSpouse :woman>> ex:certainty 0.9 .

According to the formal semantics of RDF*, each embedded triple also asserts the referenced statement and its retraction - deletes it. Unfortunately, this requirement breaks the compatibility with the standard reification and causes a non-transparent behavior when dealing with triples stored in multiple named graphs. GraphDB implements the embedded triples by introducing a new additional RDF type next to IRI, blank node, and literal. So in the previous example, the engine will store only a single triple.

Warning

GraphDB will not explicitly assert the referenced statement by an embedded triple! Every embedded triple acts as a new RDF type, which means only a reference to a statement.

Below are a few more examples of how this syntax can be utilized.

  • Object relation qualifiers:

    <<:man :hasSpouse :woman>> :startDate "2020-02-11"^^xsd:date
    

    :hasSpouse is a symmetric relation so that it can be inferred in the opposite direction. However, the metadata in the opposite direction is not asserted automatically, so it needs to be added:

    <<:woman :hasSpouse :man>> :startDate "2020-02-11"^^xsd:date
    
  • Data value qualifiers:

    <<:painting :height 32.1>>
      :unit :cm;
      :measurementTechnique :laserScanning;
      :measuredOn "2020-02-11"^^xsd:date.
    
  • Statement sources/references:

    <<:man :hasSpouse :woman>>
      :source :TheNationalEnquirer;
      :webpage <http://nationalenquirer.com/news/2020-02-12>;
      :retrieved "2020-02-13"^^xsd:dateTime.
    
  • Nested embedded triples:

    << <<:man :hasSpouse :woman>> :startDate "2020-02-11"^^xsd:date >>
        :webpage <http://nationalenquirer.com/news/2020-02-12> .
    

Carried over into the syntax of the extended query language SPARQL*, triple patterns can be embedded as well. This provides a query syntax in which accessing specific metadata about a triple is just a matter of mentioning the triple in the subject or object position of a metadata-related triple pattern. For example, by adopting the aforementioned syntax for nesting, we can query for all age statements and their respective certainty as follows:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?p ?a ?c WHERE {
    <<?p foaf:age ?a>> ex:certainty ?c .
}

Additionally, SPARQL* modifies the BIND clauses to select a group of embedded triples by using free variables:

PREFIX ex: <http://example.com/>

SELECT ?p ?a ?c WHERE {
    BIND (<<?p foaf:age ?a>> AS ?t)
    ?t ex:certainty ?c .
}

The semantics of BIND has a deviation from that of the other RDF types. When binding an embedded triple, it creates an iterator over the triple entities that match its components and binds these to the target variable. As a result, the BIND, when used with three constants, works like a FILTER. The same does not apply for VALUES, which will return any value.

PREFIX ex: <http://example.com/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT * WHERE {

    {
        # Binds the value to ?literal variable
        BIND ("new value for the store" as ?literal)
    }
    UNION
    {
        # Returns empty value and acts like a FILTER
        BIND (<<ex:subject foaf:name "new value for the store">> AS ?triple)
    }
    UNION
    {
        # Values generates new values
        VALUES ?newTriple { <<ex:subject foaf:name "new value for the store">> }
    }
}

To avoid any parsing of the embedded triple, GraphDB introduces multiple new SPARQL functions:

PREFIX : <http://example.com/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT * WHERE {

    VALUES ?triple { <<:man :hasSpouse :woman>> }

    # Checks if the variable is of type embedded triple
    BIND (rdf:isTriple(?triple) as ?isTriple)

    # Extract the subject, predicate or object from an embedded triple
    BIND (rdf:subject(?triple) as ?subject)
    BIND (rdf:predicate(?triple) as ?predicate)
    BIND (rdf:object(?triple) as ?object)

    # Create a new embedded statement
    BIND (rdf:Statement(?subject, ?predicate, ?object) as ?newTriple)
}

This also showcases the fact that in SPARQL*, variables in query results may be bound not only to IRIs, literals, or blank nodes, but also to full RDF* triples.

Convert standard reification to RDF*

The RDF* support in GraphDB does not exclude any of the other modeling approaches. It is possible to independently maintain RDF* and standard reification statements in the same repository, like:

:man :hasSpouse :woman .
:id1 rdf:type rdf:Statement ;
    rdf:subject :man ;
    rdf:predicate :hasSpouse ;
    rdf:object :woman ;
    :startDate "2020-02-11"^^xsd:date .

<<:man :hasSpouse :woman>> :startDate "2020-02-11"^^xsd:date .

Still, this is likely to confuse, so GraphDB provides a tool for converting standard reification to RDF* outside of the database using the reification-convert command line tool. If the data is already imported, use this SPARQL for a conversion:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
DELETE {
    ?reification a rdf:Statement .
    ?reification rdf:subject ?subject .
    ?reification rdf:predicate ?predicate .
    ?reification rdf:object ?object .
    ?reification ?p ?o .
} INSERT {
    <<?subject ?predicate ?object>> ?p ?o .
} WHERE {
    ?reification a rdf:Statement .
    ?reification rdf:subject ?subject .
    ?reification rdf:predicate ?predicate .
    ?reification rdf:object ?object .
    ?reification ?p ?o .
    FILTER (?p NOT IN (rdf:subject, rdf:predicate, rdf:object) &&
    (?p != rdf:type && ?object != rdf:Statement))
}

MIME types and file extensions for RDF* in RDF4J

GraphDB extends the existing RDF and query results formats with dedicated formats that encode embedded triples natively (for example, <<:subject :predicate :object>> in Turtle*). Each new format has its own MIME type and file extension:

RDF* format

MIME type

File extension

Binary RDF

application/x-binary-rdf

brf

Turtle*

text/x-turtlestar
application/x-turtlestar

ttls

TriG*

application/x-trigstar

trigs

JSON query result

application/x-sparqlstar-results+json

srjs

TSV query result

text/x-tab-separated-values-star
application/x-sparqlstar-results+tsv

tsvs

For the benefit of older clients, in all other formats the embedded triples are serialized as special IRIs in the format urn:rdf4j:triple:xxx. Here, xxx stands for the Base64 URL-safe encoding of the N-Triples representation of the embedded triple. This is controlled by a boolean writer setting, and is ON by default. The setting is ignored by writers that support RDF* natively.

Such special IRIs are converted back to triples on parsing. This is controlled by a boolean parser setting, and is ON by default. It is respected by all parsers, including those with native RDF* support.