FedX Federation¶
What’s in this document?
Overview¶
In addition to the standard SPARQL 1.1 Federation to other SPARQL endpoints and the internal SPARQL federation to other repositories in the same database instance, GraphDB also supports FedX – the federation engine of the RDF4J framework, a data partitioning technology that provides transparent federation of multiple SPARQL endpoints under a single virtual endpoint.
In the context of the growing need for scalability of RDF technology and sophisticated optimization techniques for querying linked data, it is a useful framework that allows efficient SPARQL query processing on heterogeneous, virtually integrated linked data sources. With it, explicitly addressing specific endpoints using SERVICE clauses is no longer necessary – instead, FedX offers novel join processing and grouping techniques to minimize the number of remote requests by automatically selecting the relevant sources, sending statement patterns to these sources for evaluation, and joining the individual results. It extends the Sesame framework with a federation layer and is incorporated in it as Sail (Storage and Inference Layer).
Note
Please keep in mind that the GraphDB FedX federation is currently an experimental feature.
Features¶
GraphDB supports the following FedX features:
Virtual joins of distributed knowledge graphs: Following the idea of the Linked Open Data Initiative for connecting distributed RDF data, FedX federation combines distributed data sources with the goal of virtual interaction. This means that data from multiple heterogeneous sources can be queried transparently as if being in the same database.
Federation of sharded knowledge graphs: A virtual knowledge graph can consist of various knowledge sub-graphs distributed in a separate endpoint and can be virtually integrated using FedX Join. FedX is optimized for such scenarios where each graph has a different schema, i.e., the graph is separated into exclusive groups.
Easy integration as a GraphDB repository
Transparent access to data sources through federation
Streamlined query processing in federated environments
Usage scenarios¶
In the following sections, we will demonstrate how semantic technology in the context of FedX federation lowers the cost of searching and analyzing data sources by implementing a two-step integration process: (1) mapping any dataset to an ontology and (2) using GraphDB to access the data. With this integration methodology, the cost of extending the number of supported sources remains linear unlike the classic data warehousing approaches.
Linked data cloud exploration¶
The first type of use case that we will look at is creating a unifying repository where we can query data from multiple linked data sources regardless of their location, such as DBpedia and Wikidata. In such cases, there is often a significant overlap between the schemas, i.e., predicates or types are frequently repeated across the different sources.
Note
Keep in mind that bnodes are not supported between FedX members.
Before we start exploring, let’s first create a federation between the DBpedia and Wikidata endpoints.
Create a FedX repository via
.In the configuration screen that you are taken to, click Add remote repository.
From the endpoint options in the dialog, select Generic SPARQL endpoint.
For the DBpedia Endpoint URL, enter
https://dbpedia.org/sparql
.Unselect the Supports ASK queries option, as this differs from endpoint to endpoint.
Repeat the same steps for the Wikidata endpoint URL –
https://query.wikidata.org/sparql
.Save the repository and connect it.
Now, let’s perform some queries against the federated repository.
Scenario 1: Querying one of the endpoints using predicates and nodes specific to it
The following query is run against the Wikidata endpoint and will return all instances of “house cat”.
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE
{
?item wdt:P31 wd:Q146.
?item rdfs:label ?label.
FILTER (LANG(?label) = 'en')
}
Here, we have used two Wikidata predicates: wdt:P31
that stands for “instance of” and wd:Q146
that stands for “house cat”.
These will be the first 15 house cats returned:

Scenario 2: Querying both endpoints
With a CONSTRUCT query, we can get data about a specific cat from both endpoints - CC (“CopyCat” or “Carbon Copy”), the first cloned pet.
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
?s ?p ?o
} WHERE
{
{
BIND(<http://www.wikidata.org/entity/Q378619> as ?s)
?s ?p ?o.
} UNION {
BIND(<http://www.wikidata.org/entity/Q378619> as ?o)
?s ?p ?o.
}
}

Scenario 3: Querying both endpoints for a specific resource
Let’s explore both DBpedia and Wikidata for products made by the company Amazon.
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
CONSTRUCT{
?dbpCompany ?p ?o .
?wdCompany ?p1 ?o .
?dbpCompany owl:sameAs ?wdCompany .
} WHERE {
BIND( dbr:Amazon_\(company\) as ?dbpCompany)
{
# Get all products from DBpedia
?dbpCompany dbo:product ?o .
?dbpCompany ?p ?o .
} UNION {
# Get all products from Wikidata
?dbpCompany owl:sameAs ?wdCompany .
?o wdt:P176 ?wdCompany .
?o ?p1 ?wdCompany .
}
}

Scenario 4: Creating an advanced graph configuration for a query
As we saw in the previous example, we can explore a specific resource from both endpoints. Now, let’s see how to create an advanced graph configuration for a query, with which we will then be able to explore any resource that we input.
With the following steps, create a graph config query for all companies and all products in both datasets:
Go to
.From Advanced graph configurations, select Create graph config.
Enter a name for your graph.
The default Starting point is Start with a search box. In it, select the Graph expansion tab and enter the following query:
PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX wdt: <http://www.wikidata.org/prop/direct/> CONSTRUCT{ ?node ?p1 ?o . ?s ?p ?o . ?node owl:sameAs ?s . } WHERE { { ?node dbo:product ?o . ?node ?p1 ?o . } UNION { ?node owl:sameAs ?s . ?o wdt:P176 ?s . ?o ?p ?s . } }
The two databases are connected through the
owl:sameAs
predicate. The DBpedia propertydbo:product
corresponds to thewdt:P176
property in Wikidata.Since Wikidata shows information in a less readable way, we can clear it up a little by pulling the node labels. To do so, go to the Node basics tab and enter the query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?label { { ?node rdfs:label | skos:prefLabel ?label. FILTER (lang(?label) = 'en') } }
Note
The
?node
variable is required and will be replaced with the IRI of the node that we are exploring.Click Save. You will be taken back to the Visual graph screen where the newly created graph configuration is now visible.
Now let’s explore the information about the nodes as visual graphs mapped in both data sources. Click on the name of the graph and in the search field that opens, enter the DBpedia resource
http://dbpedia.org/resource/Amazon_(company)
and click Show.On the left are the DBpedia resources related to Amazon, and on the right - the Wikidata ones.
We can do the same for company http://dbpedia.org/resource/Nvidia
:
And for http://dbpedia.org/resource/BMW
:
Note
Some SPARQL endpoints with implementation other than GraphDB may enforce limitations that could result in some features of the GraphDB FedX repository not working as expected. One such example is the class hierarchy that may send big queries and not work with https://dbpedia.org/sparql
, which has a query length limit.
Virtual knowledge graph over local native and RDBMS-based repositories¶
The second type of scenario demonstrates how to create a federated repository over two local repositories – a local native and an Ontop one. We will divide a dataset between them and then explore the relationships.
We will be using segments of two public datasets:
The native GraphDB repository uses data from the
acquisitions.csv
,ipos.csv
, andobjects.csv
files of the Startup investments dataset, a Crunchbase snapshot data source that contains metadata about companies and investors. It tracks the evolution of startups into multi-billion corporations. The data has been RDF-ized using the GraphDB OntoRefine mapping UI.The
acquisitions
file contains M&A deals between companies, listing all buyers and acquired companies and the date of the deal.The
objects
file contains details about the companies, such as ID, name, country etc.The
ipo
file contains data about companies IPOs.
The Ontop repository uses the
prices.csv
file of the NYSE dataset, a data source listing the opening/closing stock price and traded volumes on the New York Stock Exchange. The file lists stock symbols and opening/closing stock market prices for particular dates. Most data span from 2010 to the end 2016, and for companies new on the stock market the date range is shorter.
To set up the native GraphDB repository:
Create a new repository.
Download the
ipo.nq
,acquisitions.ttl
andobjects.ttl
files.Load them into the repository via
.
To set up the Ontop repository:
Download the
prices-mapping.obda
file.Create an Ontop repository using the OBDA file.
To create the FedX repository where these two will be federated:
Now that we have created the federation, let’s see how we can explore the two data sources with it.
Scenario 1: List European companies acquired by US companies
The following query is run against the Crunchbase data source and returns all companies registered in European countries that have been acquired by US companies.
PREFIX cb: <https://crunchbase.com/resource/cb/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?sellingCompanyName ?buyerCompanyName
WHERE {
?id cb:buyer ?buyerCompany .
?id cb:target ?acquiredCompany .
?buyerCompany cb:country "USA" .
?acquiredCompany cb:country ?country .
FILTER (?country IN ("AUT", "BEL", "BGR", "HRV", "CYP", "CZE", "DNK", "EST", "FIN", "FRA", "DEU", "GBR", "GRC", "HUN", "IRL", "ITA", "LVA", "LTU", "LUX", "MLT", "NLD", "POL", "PRT", "ROU", "SVK", "SVN", "ESP", "SWE"))
?acquiredCompany rdfs:label ?sellingCompanyName .
?buyerCompany rdfs:label ?buyerCompanyName .
}
The first two triples represent the acquiring and the acquired company. The “USA” literal specifies that the buyer company is based there. The target company has to be European. The country of each company is represented by a country code. To get only the European companies that have been acquired, a filter is used that checks if a given country’s code is among the listed ones.
The first 15 returned results look like this:

Scenario 2: List European companies acquired by US companies where the stock market price of the buyer company has increased on the date of the M&A deal.
This query is run against the Crunchbase and the NYSE datasets and is similar to the one above, but with one additional condition – that on the day of the deal, the stock price of the buying company has increased. This means that when the stock market closed, that price was higher than when the market opened. Since the M&A deals data are in the Crunchbase dataset and the stock prices data in the NYSE dataset, we will join them on the stockSymbol
field, which is present in both datasets, and the IPO of the buyer company.
We also make sure that the date of the M&A deal (from Crunchbase) is the same as the date for which we retrieve the opening and closing stock prices (from NYSE). In the SELECT clause, we include only the names of the buyer and seller companies. The opening and closing prices are chosen for a particular date and stock symbol.
PREFIX ny: <https://www.kaggle.com/dgawlik/nyse/>
PREFIX cb: <https://crunchbase.com/resource/cb/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?sellingCompanyName ?buyerCompanyName
WHERE {
?id cb:buyer ?buyerCompany .
?id cb:target ?acquiredCompany .
?acquiredCompany cb:country ?country .
?buyerCompany cb:country "USA" .
?buyerCompany cb:ipo ?buyerCompanyIpo .
?buyerCompanyIpo cb:stockSymbol ?stockSymbol .
?id cb:acquiredAt ?date .
?dayTrade ny:date ?date .
?dayTrade ny:ticker ?stockSymbol .
?dayTrade ny:close ?close .
?dayTrade ny:open ?open .
FILTER (?country IN ("AUT", "BEL", "BGR", "HRV", "CYP", "CZE", "DNK", "EST", "FIN", "FRA", "DEU", "GBR", "GRC", "HUN", "IRL", "ITA", "LVA", "LTU", "LUX", "MLT", "NLD", "POL", "PRT", "ROU", "SVK", "SVN", "ESP", "SWE") && (?close > ?open))
?acquiredCompany rdfs:label ?sellingCompanyName .
?buyerCompany rdfs:label ?buyerCompanyName .
}
The first 15 returned results look like this:

FedX with enabled GraphDB security¶
When creating a FedX repository with local members, we can specify whether the FedX repo should respect the security rights of the member repositories.
Configuring security for local members¶
First, we will create two repositories, “Sofia” and “London”, in which we will insert some statements from factforge.net:
Create a repository called Sofia.
Go to the SPARQL editor of factforge.net and execute:
CONSTRUCT WHERE { ?s ?p <http://dbpedia.org/resource/Sofia> . } LIMIT 20
Download the results in Turtle format.
Import the file in the Sofia repository via
.Repeat the same steps for the “London” repository with FactForge data about London.
Create a FedX repository with the “Sofia” and “London” repositories as members.
In it, execute the SPARQL editor default query that returns all statements in the repository:
SELECT * WHERE { ?s ?p ?o . }
We can observe that all statements for both Sofia and London are returned as results (here ordered by subject in alphabetical order so as to show results for both):
Now, to see how this works with GraphDB security enabled, go to
, and set Security to ON.From the same page, create a new user called “sofia” with read rights for the “Sofia” and the FedX repositories:
From
, click the edit icon of the FedX repository to enter its configuration.Click the edit icon of either of the “Sofia” or “London” member repositories. This will open a security setting dialog where you can see that the default setting of each member is to respect the repository’s access rights, meaning that if a user has no rights to this repository, they will see a federated view that does not include results from it.
Log out as admin and log in as user “sofia”.
In the SPARQL editor, execute:
SELECT * WHERE { ?s ?p ?o . }
We can see that only results for the Sofia repository are shown, because the current user has no access to the London repository and the FedX repository is instructed to respect the rights for it.
Log out from the “sofia” user and log back in as admin.
Open the edit screen of the FedX repository and set the security of both its members to ignore the repository’s access rights. This means that in the federation, users will see results from the respective repository regardless of their access rights for it.
After editing the Sofia and London repositories this way, Save the changes in the FedX repository.
Log out as admin and log in as user “sofia”.
In the SPARQL editor, execute:
SELECT * WHERE { ?s ?p ?o . }
We will see that the returned results (193 in total, like in the above example without security enabled) include statements from both the “Sofia” and the “London” members of the federated repository.
Configuring security for remote endpoints¶
Basic authentication for remote members¶
GraphDB supports configuration of basic authentication when attaching a remote endpoint. Let’s see how this works with the following example:
Run a second GraphDB instance on
localhost:7201
. The easiest way to do this is to:Make a copy of your GraphDB distribution.
Run it with
graphdb -Dgraphdb.connector.port=7201
.
In it, create a repository called “remote-repo-paris” with enabled security and default admin user, i.e., username: “admin”, password: “root”.
Go to the SPARQL editor of factforge.net and execute:
CONSTRUCT WHERE { ?s ?p <http://dbpedia.org/resource/Paris> . } LIMIT 20
Download the results as a Turtle file and import them into “remote-repo-paris”.
Go to the first GraphDB instance on port 7200 and open the “fedx-sofia-london” repository that we created earlier. It already has two members - “Sofia” and “London”.
In it, include as member the “sofia-london” we just created:
Restart the repository.
In the SPARQL editor, execute:
SELECT * WHERE { ?s ?p ?o . }
We see that all the Paris data from the remote endpoint are available in our FedX repository.
Security of a remote repository from a known location¶
The context is the same as the previous scenario – two running GDB instances with the second one being secured. The difference is that when the remote repository is a known location, we can configure its security credentials when adding it as a location instead of when adding it as a remote FedX member. Let’s see how to do it.
Start the same way as in the example above:
Run a second GraphDB instance on
localhost:7201
.In it, create a repository called “remote-repo-paris” with enabled security and default admin user, i.e., username: “admin”, password: “root”.
Import the Paris data in it.
In the first GraphDB instance on port 7200, attach “remote-repo-paris” as a remote location following these steps. For Authentication type, select Basic auth, and input the credentials.
Again in the 7200 GraphDB instance, open the edit view of the “fedx-sofia-london” repository.
In it, include as member the “remote-repo-paris” from the 7201 port. Note that this time, we are not inputting the security credentials.
Restart the FedX repository.
In the SPARQL editor, execute:
SELECT * WHERE { ?s ?p ?o . }
Again, we see that all the Paris data from the remote location are available in the FedX repository.
Hint
You can configure signature authentication for remote endpoints in the same way.
Configuration parameters¶
When configuring a FedX repository, several configuration options (described in detail below) can be set:
![]()
Include inferred default: Whether to include inferred statements. Default is
true
.Enable service as bound join: Determines whether vectored evaluation using the VALUES clause should be applied for SERVICE expressions. Default is
true
.Log queries: Enables/disables query logging. Prints the query in the logs. Default is
false
.Log query plan: Enables/disables query plan logging. Default is
false
.Debug query plan: The debug mode for the query execution plan. If enabled, the plan is printed in the logs. Default is
false
.Query timeout (seconds): Sets the maximum query time in seconds used for query evaluation. Can be set to
0
or less in order to disable query timeouts. If the limit is exceeded, an exception “Query evaluation error: Source selection has run into a timeout” is thrown. Default is0
.Bound join block size: The block size for a bound join, i.e., the number of bindings that are integrated in a single subquery. Default is
15
.Join worker threads: The (maximum) number of join worker threads used in the
ControlledWorkerScheduler
for join operations. Default is20
.Left join worker threads: The (maximum) number of left join worker threads used in the
ControlledWorkerScheduler
for join operations. Sets the number of threads that can work in parallel evaluating a query with OPTIONAL. Default is10
.Union worker threads: The (maximum) number of union worker threads used in the
ControlledWorkerScheduler
for join operations. Sets the number of threads that can work in parallel evaluating a query with UNION. Default is20
.Source selection cache spec: Parameters should be passed as
key1=value1,key2=value2,...
in order to be parsed correctly.Parameters that can be passed:
recordStats
(boolean)
initialCapacity
(int)
maximumSize
(long)
maximumWeight
(long)
concurrencyLevel
(int)
recordStats
(boolean)
refreshDuration
(long)
expireAfterWrite
(TimeUnit/long)
expireAfterAccess
(TimeUnit/long)
refreshAfterWrite
(TimeUnit/long)
Limitations¶
Some limitations of the current implementation of the GraphDB FedX federation are:
DESCRIBE queries are not supported.
FedX is not stable with queries of the type
{?s ?p ?o} UNION {?s ?p1 ?o} FILTER (xxx)
.Currently, the federation only works with remote repositories, i.e., everything goes through HTTP, which is slower compared to direct access to local repositories.
Queries with a Cartesian product or cyclic connections are not stable due to connections that are still open and to blocked threads.
There is a small possibility of threads being blocked on complex queries due to implementation flows in parallelization.