FedX Federation

Overview

In addition to the standard SPARQL 1.1 Federation to other SPARQL endpoints and the internal SPARQL federation to other repositories in the same database instance, GraphDB also supports FedX – the federation engine of the RDF4J framework, a data partitioning technology that provides transparent federation of multiple SPARQL endpoints under a single virtual endpoint.

In the context of the growing need for scalability of RDF technology and sophisticated optimization techniques for querying linked data, it is a useful framework that allows efficient SPARQL query processing on heterogeneous, virtually integrated linked data sources. With it, explicitly addressing specific endpoints using SERVICE clauses is no longer necessary – instead, FedX offers novel join processing and grouping techniques to minimize the number of remote requests by automatically selecting the relevant sources, sending statement patterns to these sources for evaluation, and joining the individual results. It extends the Sesame framework with a federation layer and is incorporated in it as Sail (Storage and Inference Layer).

Note

Please keep in mind that the GraphDB FedX federation is currently an experimental feature.

Features

GraphDB supports the following FedX features:

  • Virtual joins of distributed knowledge graphs: Following the idea of the Linked Open Data Initiative for connecting distributed RDF data, FedX federation combines distributed data sources with the goal of virtual interaction. This means that data from multiple heterogeneous sources can be queried transparently as if being in the same database.

  • Federation of sharded knowledge graphs: A virtual knowledge graph can consist of various knowledge sub-graphs distributed in a separate endpoint and can be virtually integrated using FedX Join. FedX is optimized for such scenarios where each graph has a different schema, i.e., the graph is separated into exclusive groups.

  • Easy integration as a GraphDB repository

  • Transparent access to data sources through federation

  • Streamlined query processing in federated environments

Usage scenarios

In the following sections, we will demonstrate how semantic technology in the context of FedX federation lowers the cost of searching and analyzing data sources by implementing a two-step integration process: (1) mapping any dataset to an ontology and (2) using GraphDB to access the data. With this integration methodology, the cost of extending the number of supported sources remains linear unlike the classic data warehousing approaches.

Linked data cloud exploration

The first type of use case that we will look at is creating a unifying repository where we can query data from multiple linked data sources regardless of their location, such as DBpedia and Wikidata. In such cases, there is often a significant overlap between the schemas, i.e., predicates or types are frequently repeated across the different sources.

Note

Keep in mind that bnodes are not supported between FedX members.

Before we start exploring, let’s first create a federation between the DBpedia and Wikidata endpoints.

  1. Create a FedX repository via Setup ‣ Repositories ‣ Create new repository ‣ FedX Virtual SPARQL.

  2. In the configuration screen that you are taken to, click Add remote repository.

  3. From the endpoint options in the dialog, select Generic SPARQL endpoint.

  4. For the DBpedia Endpoint URL, enter https://dbpedia.org/sparql.

  5. Unselect the Supports ASK queries option, as this differs from endpoint to endpoint.

    _images/fedx-linked-data-add-remote-repo.png
  6. Repeat the same steps for the Wikidata endpoint URL – https://query.wikidata.org/sparql.

  7. Save the repository and connect it.

Now, let’s perform some queries against the federated repository.

Scenario 1: Querying one of the endpoints using predicates and nodes specific to it

The following query is run against the Wikidata endpoint and will return all instances of “house cat”.

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT * WHERE
{
    ?item wdt:P31 wd:Q146.
    ?item rdfs:label ?label.
    FILTER (LANG(?label) = 'en')
}

Here, we have used two Wikidata predicates: wdt:P31 that stands for “instance of” and wd:Q146 that stands for “house cat”.

These will be the first 15 house cats returned:

_images/fedx-linked-data-scenario1.png

Scenario 2: Querying both endpoints

With a CONSTRUCT query, we can get data about a specific cat from both endpoints - CC (“CopyCat” or “Carbon Copy”), the first cloned pet.

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

CONSTRUCT {
    ?s ?p ?o
} WHERE
{
    {
        BIND(<http://www.wikidata.org/entity/Q378619> as ?s)
        ?s ?p ?o.
    } UNION {
        BIND(<http://www.wikidata.org/entity/Q378619> as ?o)
        ?s ?p ?o.
    }
}
_images/fedx-linked-data-scenario2.png

Scenario 3: Querying both endpoints for a specific resource

Let’s explore both DBpedia and Wikidata for products made by the company Amazon.

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

CONSTRUCT{
    ?dbpCompany ?p ?o .
    ?wdCompany ?p1 ?o .
    ?dbpCompany owl:sameAs ?wdCompany .
} WHERE {
    BIND( dbr:Amazon_\(company\) as ?dbpCompany)
    {
        # Get all products from DBpedia
        ?dbpCompany dbo:product ?o .
        ?dbpCompany ?p ?o .
    } UNION {
        # Get all products from Wikidata
        ?dbpCompany owl:sameAs ?wdCompany .
         ?o wdt:P176 ?wdCompany .
        ?o ?p1 ?wdCompany .
    }
}
_images/fedx-linked-data-scenario3.png

Scenario 4: Creating an advanced graph configuration for a query

As we saw in the previous example, we can explore a specific resource from both endpoints. Now, let’s see how to create an advanced graph configuration for a query, with which we will then be able to explore any resource that we input.

With the following steps, create a graph config query for all companies and all products in both datasets:

  1. Go to Explore ‣ Visual graph.

  2. From Advanced graph configurations, select Create graph config.

  3. Enter a name for your graph.

  4. The default Starting point is Start with a search box. In it, select the Graph expansion tab and enter the following query:

    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    
    CONSTRUCT{
        ?node ?p1 ?o .
        ?s ?p ?o .
        ?node owl:sameAs ?s .
    } WHERE {
        {
            ?node dbo:product ?o .
            ?node ?p1 ?o .
        } UNION {
            ?node owl:sameAs ?s .
            ?o wdt:P176 ?s .
            ?o ?p ?s .
        }
    }
    

    The two databases are connected through the owl:sameAs predicate. The DBpedia property dbo:product corresponds to the wdt:P176 property in Wikidata.

  5. Since Wikidata shows information in a less readable way, we can clear it up a little by pulling the node labels. To do so, go to the Node basics tab and enter the query:

    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    
    SELECT ?label {
        {
            ?node rdfs:label | skos:prefLabel ?label.
            FILTER (lang(?label) = 'en')
        }
    }
    

    Note

    The ?node variable is required and will be replaced with the IRI of the node that we are exploring.

  6. Click Save. You will be taken back to the Visual graph screen where the newly created graph configuration is now visible.

    _images/fedx-linked-data-scenario4-graph-config.png
  7. Now let’s explore the information about the nodes as visual graphs mapped in both data sources. Click on the name of the graph and in the search field that opens, enter the DBpedia resource http://dbpedia.org/resource/Amazon_(company) and click Show.

    On the left are the DBpedia resources related to Amazon, and on the right - the Wikidata ones.

    _images/fedx-linked-data-amazon.png

We can do the same for company http://dbpedia.org/resource/Nvidia:

_images/fedx-linked-data-nvidia.png

And for http://dbpedia.org/resource/BMW:

_images/fedx-linked-data-bmw.png

Note

Some SPARQL endpoints with implementation other than GraphDB may enforce limitations that could result in some features of the GraphDB FedX repository not working as expected. One such example is the class hierarchy that may send big queries and not work with https://dbpedia.org/sparql, which has a query length limit.

Virtual knowledge graph over local native and RDBMS-based repositories

The second type of scenario demonstrates how to create a federated repository over two local repositories – a local native and an Ontop one. We will divide a dataset between them and then explore the relationships.

We will be using segments of two public datasets:

  • The native GraphDB repository uses data from the acquisitions.csv, ipos.csv, and objects.csv files of the Startup investments dataset, a Crunchbase snapshot data source that contains metadata about companies and investors. It tracks the evolution of startups into multi-billion corporations. The data has been RDF-ized using Ontotext Refine.

  • The Ontop repository uses the prices.csv file of the NYSE dataset, a data source listing the opening/closing stock price and traded volumes on the New York Stock Exchange. The file lists stock symbols and opening/closing stock market prices for particular dates. Most data span from 2010 to the end 2016, and for companies new on the stock market the date range is shorter.

  1. To set up the native GraphDB repository:

    1. Create a new repository.

    2. Download the ipo.nq, acquisitions.ttl and objects.ttl files.

    3. Load them into the repository via Import ‣ User data ‣ Upload RDF files.

  2. To set up the Ontop repository:

    1. Download the prices-mapping.obda file.

    2. Create an Ontop repository using the OBDA file.

  3. To create the FedX repository where these two will be federated:

    1. Create it via Setup ‣ Repositories ‣ Create new repository ‣ FedX Virtual SPARQL.

    2. In the configuration screen, include the two repositories that we created as members by clicking on them.

      _images/fedx-repo-native-ontop.png
    3. After it has been created, connect to it.

Now that we have created the federation, let’s see how we can explore the two data sources with it.

Scenario 1: List European companies acquired by US companies

The following query is run against the Crunchbase data source and returns all companies registered in European countries that have been acquired by US companies.

PREFIX cb: <https://crunchbase.com/resource/cb/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?sellingCompanyName ?buyerCompanyName
WHERE {
    ?id cb:buyer ?buyerCompany .
    ?id cb:target ?acquiredCompany .
    ?buyerCompany cb:country "USA" .
    ?acquiredCompany cb:country ?country .
    FILTER (?country IN ("AUT", "BEL", "BGR", "HRV", "CYP", "CZE", "DNK", "EST", "FIN", "FRA", "DEU", "GBR", "GRC", "HUN", "IRL", "ITA", "LVA", "LTU", "LUX", "MLT", "NLD", "POL", "PRT", "ROU", "SVK", "SVN", "ESP", "SWE"))
    ?acquiredCompany rdfs:label ?sellingCompanyName .
    ?buyerCompany rdfs:label ?buyerCompanyName .
}

The first two triples represent the acquiring and the acquired company. The “USA” literal specifies that the buyer company is based there. The target company has to be European. The country of each company is represented by a country code. To get only the European companies that have been acquired, a filter is used that checks if a given country’s code is among the listed ones.

The first 15 returned results look like this:

_images/fedx-ontop-scenario1-results.png

Scenario 2: List European companies acquired by US companies where the stock market price of the buyer company has increased on the date of the M&A deal.

This query is run against the Crunchbase and the NYSE datasets and is similar to the one above, but with one additional condition – that on the day of the deal, the stock price of the buying company has increased. This means that when the stock market closed, that price was higher than when the market opened. Since the M&A deals data are in the Crunchbase dataset and the stock prices data in the NYSE dataset, we will join them on the stockSymbol field, which is present in both datasets, and the IPO of the buyer company.

We also make sure that the date of the M&A deal (from Crunchbase) is the same as the date for which we retrieve the opening and closing stock prices (from NYSE). In the SELECT clause, we include only the names of the buyer and seller companies. The opening and closing prices are chosen for a particular date and stock symbol.

PREFIX ny: <https://www.kaggle.com/dgawlik/nyse/>
PREFIX cb: <https://crunchbase.com/resource/cb/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?sellingCompanyName ?buyerCompanyName
WHERE {
    ?id cb:buyer ?buyerCompany .
    ?id cb:target ?acquiredCompany .
    ?acquiredCompany cb:country ?country .
    ?buyerCompany cb:country "USA" .
    ?buyerCompany cb:ipo ?buyerCompanyIpo .
    ?buyerCompanyIpo cb:stockSymbol ?stockSymbol .
    ?id cb:acquiredAt ?date .
    ?dayTrade ny:date ?date .
    ?dayTrade ny:ticker ?stockSymbol .
    ?dayTrade ny:close ?close .
    ?dayTrade ny:open ?open .
    FILTER (?country IN ("AUT", "BEL", "BGR", "HRV", "CYP", "CZE", "DNK", "EST", "FIN", "FRA", "DEU", "GBR", "GRC", "HUN", "IRL", "ITA", "LVA", "LTU", "LUX", "MLT", "NLD", "POL", "PRT", "ROU", "SVK", "SVN", "ESP", "SWE") && (?close > ?open))
    ?acquiredCompany rdfs:label ?sellingCompanyName .
    ?buyerCompany rdfs:label ?buyerCompanyName .
}

The first 15 returned results look like this:

_images/fedx-ontop-scenario2-results.png

FedX with enabled GraphDB security

When creating a FedX repository with local members, we can specify whether the FedX repo should respect the security rights of the member repositories.

Configuring security for local members

  1. First, we will create two repositories, “Sofia” and “London”, in which we will insert some statements from factforge.net:

    1. Create a repository called Sofia.

    2. Go to the FactForge SPARQL editor and execute:

      CONSTRUCT WHERE {
          ?s ?p <http://dbpedia.org/resource/Sofia> .
      } LIMIT 20
      
    3. Download the results in Turtle format.

    4. Import the file in the Sofia repository via Import ‣ User data ‣ Upload RDF files.

    5. Repeat the same steps for the “London” repository with FactForge data about London.

  2. Create a FedX repository with the “Sofia” and “London” repositories as members.

    _images/fedx-sofia-london-create-repo.png

    The icons next to the name of each member are for:

    • editing the repository’s access rights

    • removing the repository as a FedX member (this will move it back in the repository list)

    • setting the repository as writable. Note that only one member repository can be writable.

  3. In it, execute the SPARQL editor default query that returns all statements in the repository:

    SELECT * WHERE {
        ?s ?p ?o .
    }
    
  4. We can observe that all statements for both Sofia and London are returned as results (here ordered by subject in alphabetical order so as to show results for both):

    _images/fedx-sofia-london-results1.png
  5. Now, to see how this works with GraphDB security enabled, go to Setup ‣ Users and Access, and set Security to ON.

  6. From the same page, create a new user called “sofia” with read rights for the “Sofia” and the FedX repositories:

    _images/fedx-sofia-london-create-user.png
  7. From Setup ‣ Repositories, click the edit icon of the FedX repository to enter its configuration.

  8. Click the edit icon of either of the “Sofia” or “London” member repositories. This will open a security setting dialog where you can see that the default setting of each member is to respect the repository’s access rights, meaning that if a user has no rights to this repository, they will see a federated view that does not include results from it.

    _images/fedx-edit-local-repo-rights.png
  9. Log out as admin and log in as user “sofia”.

  10. In the SPARQL editor, execute:

    SELECT * WHERE {
        ?s ?p ?o .
    }
    

    We can see that only results for the Sofia repository are shown, because the current user has no access to the London repository and the FedX repository is instructed to respect the rights for it.

    _images/fedx-security-respect-rights.png
  11. Log out from the “sofia” user and log back in as admin.

  12. Open the edit screen of the FedX repository and set the security of both its members to ignore the repository’s access rights. This means that in the federation, users will see results from the respective repository regardless of their access rights for it.

  13. After editing the Sofia and London repositories this way, Save the changes in the FedX repository.

  14. Log out as admin and log in as user “sofia”.

  15. In the SPARQL editor, execute:

    SELECT * WHERE {
        ?s ?p ?o .
    }
    
  16. We will see that the returned results include statements from both the “Sofia” and the “London” members of the federated repository.

Configuring security for remote endpoints

Basic authentication for remote members

GraphDB supports configuration of basic authentication when attaching a remote endpoint. Let’s see how this works with the following example:

  1. Run a second GraphDB instance on localhost:7201. The easiest way to do this is to:

    • Make a copy of your GraphDB distribution.

    • Run it with graphdb -Dgraphdb.connector.port=7201.

  2. In it, create a repository called “remote-repo-paris” with enabled security and default admin user, i.e., username: “admin”, password: “root”.

  3. Go to the FactForge SPARQL editor and execute:

    CONSTRUCT WHERE {
        ?s ?p <http://dbpedia.org/resource/Paris> .
    } LIMIT 20
    
  4. Download the results as a Turtle file and import them into “remote-repo-paris”.

  5. Go to the first GraphDB instance on port 7200 and open the “fedx-sofia-london” repository that we created earlier. It already has two members - “Sofia” and “London”.

  6. In it, include as member the “remote-repo-paris” we just created:

    1. Select the GraphDB/RDF4J server option.

    2. As Server URL, enter the URL of the remote repository - http://localhost:7201/.

    3. Repository ID is the name of the remote repo - remote-repo-paris.

    4. Authentication credentials are the user and password for the remote repo.

    5. Add.

      _images/fedx-remote-member-basic-auth.png
  7. Restart the repository.

  8. In the SPARQL editor, execute:

    SELECT * WHERE {
        ?s ?p ?o .
    }
    

We see that all the Paris data from the remote endpoint are available in our FedX repository.

_images/fedx-remote-member-basic-auth-results.png
Security of a remote repository from a known location

The context is the same as the previous scenario – two running GDB instances with the second one being secured. The difference is that when the remote repository is a known location, we can configure its security credentials when adding it as a location instead of when adding it as a remote FedX member. Let’s see how to do it.

  1. Start the same way as in the example above:

    • Run a second GraphDB instance on localhost:7201.

    • In it, create a repository called “remote-repo-paris” with enabled security and default admin user, i.e., username: “admin”, password: “root”.

    • Import the Paris data in it.

  2. In the first GraphDB instance on port 7200, attach “remote-repo-paris” as a remote location following these steps. For Authentication type, select Basic auth, and input the credentials.

    _images/fedx-attach-remote-location.png
  3. Again in the 7200 GraphDB instance, open the edit view of the “fedx-sofia-london” repository.

  4. In it, include as member the “remote-repo-paris” from the 7201 port. Note that this time, we are not inputting the security credentials.

    _images/fedx-add-remote-location-as-member.png
  5. Restart the FedX repository.

  6. In the SPARQL editor, execute:

    SELECT * WHERE {
        ?s ?p ?o .
    }
    

    Again, we see that all the Paris data from the remote location are available in the FedX repository.

Hint

You can configure signature authentication for remote endpoints in the same way.

Configuration parameters

When configuring a FedX repository, several configuration options (described in detail below) can be set:

_images/fedx-config-parameters.png
  • Include inferred default: Whether to include inferred statements. Default is true.

  • Enable service as bound join: Determines whether vectored evaluation using the VALUES clause should be applied for SERVICE expressions. Default is true.

  • Log queries: Enables/disables query logging. Prints the query in the logs. Default is false.

  • Log query plan: Enables/disables query plan logging. Default is false.

  • Debug query plan: The debug mode for the query execution plan. If enabled, the plan is printed in the logs. Default is false.

  • Query timeout (seconds): Sets the maximum query time in seconds used for query evaluation. Can be set to 0 or less in order to disable query timeouts. If the limit is exceeded, an exception “Query evaluation error: Source selection has run into a timeout” is thrown. Default is 0.

  • Bound join block size: The block size for a bound join, i.e., the number of bindings that are integrated in a single subquery. Default is 15.

  • Join worker threads: The (maximum) number of join worker threads used in the ControlledWorkerScheduler for join operations. Default is 20.

  • Left join worker threads: The (maximum) number of left join worker threads used in the ControlledWorkerScheduler for join operations. Sets the number of threads that can work in parallel evaluating a query with OPTIONAL. Default is 10.

  • Union worker threads: The (maximum) number of union worker threads used in the ControlledWorkerScheduler for join operations. Sets the number of threads that can work in parallel evaluating a query with UNION. Default is 20.

  • Source selection cache spec: Parameters should be passed as key1=value1,key2=value2,... in order to be parsed correctly.

    Parameters that can be passed:

    • recordStats (boolean)

    • initialCapacity (int)

    • maximumSize (long)

    • maximumWeight (long)

    • concurrencyLevel (int)

    • recordStats (boolean)

    • refreshDuration (long)

    • expireAfterWrite (TimeUnit/long)

    • expireAfterAccess (TimeUnit/long)

    • refreshAfterWrite (TimeUnit/long)

Limitations

Some limitations of the current implementation of the GraphDB FedX federation are:

  • DESCRIBE queries are not supported.

  • FedX is not stable with queries of the type {?s ?p ?o} UNION {?s ?p1 ?o} FILTER (xxx).

  • Currently, the federation only works with remote repositories, i.e., everything goes through HTTP, which is slower compared to direct access to local repositories.

  • Queries with a Cartesian product or cyclic connections are not stable due to connections that are still open and to blocked threads.

  • There is a small possibility of threads being blocked on complex queries due to implementation flows in parallelization.