SPARQL Federation¶
What’s in this document?
Overview¶
SPARQL 1.1 Federation provides extensions to the query syntax for executing distributed queries over any number of SPARQL endpoints. This feature is very powerful, and allows integration of RDF data from different sources using a single query.
For example, to discover DBpedia resources about people who have the same names as those stored in a local repository, use the following query:
SELECT ?dbpedia_id
WHERE {
?person a foaf:Person ;
foaf:name ?name .
SERVICE <http://dbpedia.org/sparql> {
?dbpedia_id a dbpedia-owl:Person ;
foaf:name ?name .
}
}
It matches the first part against the local repository and for each person it finds, it checks the DBpedia SPARQL endpoint to see if a person with the same name exists and, if so, returns the ID.
Note
Federation must be used with caution. First of all, to avoid doing excessive querying of remote (public) SPARQL endpoints, but also because it can lead to inefficient query patterns.
The following example finds resources in the second SPARQL endpoint that have a similar rdfs:label
to the rdfs:label
of <http://dbpedia.org/resource/Vaccination>
in the first SPARQL endpoint:
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
SELECT ?endpoint2_id {
SERVICE <http://faraway_endpoint.org/sparql>
{
?endpoint1_id rdfs:label ?l1 .
FILTER( langMatches(lang(?l1), "en") )
}
SERVICE <http://remote_endpoint.com/sparql>
{
?endpoint2_id rdfs:label ?l2 .
FILTER( str(?l2) = str(?l1) )
}
}
BINDINGS ?endpoint1_id
{ ( <http://dbpedia.org/resource/Vaccination> ) }
However, such a query is very inefficient, because no intermediate bindings are passed between endpoints. Instead, both subqueries execute independently, requiring the second subquery to return all X rdfs:label Y
statements that it stores. These are then joined locally to the (likely much smaller) results of the first subquery.
Query execution can be optimized by batching multiple values where the following is valid:
The default batching size is
15
, which is ok to use in most cases.You can change the default via the
graphdb.federation.block.join.size
global property.By using a system graph, you can set a value only for a particular query evaluation.
Internal SPARQL federation¶
Since RDF4J repositories are also SPARQL endpoints, it is possible to use the federation mechanism to do distributed querying over several repositories on a local server. You can do it by referring to them as a standard SERVICE with their full path, or, if they are running on the same GraphDB instance, you can use the optimized local repository
prefix. The prefix triggers the internal federation mechanism. The internal SPARQL federation is used in almost the same way as the standard SPARQL federation over HTTP, and has several advantages:
- Speed
The HTTP transport layer is bypassed and iterators are accessed directly. The speed is comparable to accessing data in the same repository.
- Security
When security is ON, you can access every repository that is readable by the currently authenticated user. Standard SPARQL 1.1 federation does not support authentication.
- Flexibility
Inline parameters provide control over inference and statement expansion over
owl:sameAs
.
Usage¶
Instead of providing a URL to a remote repository, you need to provide a special URL of the form repository:NNN
,
where NNN
is the ID of the repository you want to access. For example, to access the repository authors
via
internal federation, use a query like this:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX books: <http://example.com/books/>
SELECT ?authorName WHERE {
?book rdfs:label "The Hitchhiker's Guide to the Galaxy" ;
books:author ?author .
SERVICE <repository:authors> {
?author rdfs:label ?authorName
}
}
The approach applied for DBpedia, SERVICE <http://localhost:7200/repositories/my_labels>
, is also valid, but is less efficient.
Parameters¶
There are four parameters that control how the federated part of the query is executed:
Parameter |
Definition |
---|---|
|
Controls if inferred statements are included. When set to |
|
Controls if statements are expanded over When set to |
|
Can be repeated multiple times, translates to |
|
Can be repeated multiple times, translates to |
To set a parameter, put a comma after the special URL referring to the internal repository, then the parameter name, an equals sign, and finally the value of the parameter. If you need to set more than one parameter, put another comma, parameter name, equals sign, and value.
Some examples:
repository:NNN,infer=false
Turns off inference and inferred statements are not included in the results.
repository:NNN,sameAs=false
Turns off the expansion of statements over
owl:sameAs
and they are not included in the results.repository:NNN,infer=false,sameAs=false
Turns off the inferred statements and they are not included in the results.
Turns off the expansion of statements over
owl:sameAs
and they are not included in the results.service <repository:repo1>
No
FROM
andFROM NAMED
.service <repository:repo1,from=http://test.com>
Adds
FROM <http://test.com>
.service <repository:repo1,fromNamed=http://test.com/named>
Adds
FROM NAMED <http://test.com/named>
.service <repository:repo1,from=http://test.com,fromNamed=http://test.com/named,sameAs=false>
Adds
FROM <http://test.com>
, addsFROM NAMED <http://test.com/named>
, does not expand overowl:sameAs
.
Note
This needs to be a valid URL and thus there cannot be spaces/blanks.
The example SPARQL query from above will look like this if you want to skip the inferred statements and disable
the expansion over owl:sameAs
:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX books: <http://example.com/books/>
SELECT ?authorName WHERE {
?book rdfs:label "The Hitchhiker's Guide to the Galaxy" ;
books:author ?author .
SERVICE <repository:authors,infer=false,sameAs=false> {
?author rdfs:label ?authorName
}
}
Federated query to a remote password-protected repository¶
You can also use federation to query a remote password-protected GraphDB repository. There are two ways to do this:
By editing the repository configuration as follows:
Download the
configuration file
.In it, edit the
repositoryURL
(<http://user:password@localhost:7200/repositories/<RepositoryName>
) by placing your login details and the remote repository name.Stop GraphDB if it is running.
Create a new directory in
$GDB_HOME/data/repositories/
with the same name asrepositoryID
from the config file.Place the edited config file in the newly created folder. Make sure that it is named
config.ttl
, as otherwise GraphDB will not recognize it and the repository will not be created.Start GraphDB again.
By importing the repository configuration file in the Workbench (does not require stopping GraphDB):
Download the mentioned configuration file.
In it, change
rep:repositoryID "<RepoName>"
to the name of your repository.Edit the
repositoryURL
(<http://user:password@localhost:7200/repositories/<RepositoryName>
) by placing your login details and the remote repository name.Open GraphDB Workbench and go to
.Upload the file. The newly created repository will have the same name used for
<RepoName>
.
This will enable you to query the remote repository like in the above example:
SELECT ?id ?label
WHERE {
?id a ex:Concept .
SERVICE <repository:my_labels> {
?id rdfs:label ?label.
}
}
Any URL parameters supported by the remote endpoint can be used, e.g., if it is an RDF4J/GraphDB repository, it could be a URL like http://factforge.net/repositories/ff-news?infer=false
to include only explicit statements.