Virtualization

Overview and features

The data virtualization in GraphDB enables direct access to relational databases with SPARQL queries, which eliminates the need to replicate data. The implementation exposes a virtual SPARQL endpoint, which translates the queries to SQL using a declarative mapping. To achieve this functionality, GraphDB integrates with the open-source Ontop project and extends it with multiple GraphDB specific features.

The following SPARQL features are supported:

  • SELECT and CONSTRUCT queries

  • Default and named graph triple patterns

  • Triple pattern combining: OPTIONAL, UNION, blank node path

  • Result filtering and value bindings: FILTER, BIND, VALUES

  • Projection modifiers: DISTINCT, LIMIT, ORDER BY

  • Aggregates (GROUP BY, SUM, COUNT, AVG, MIN, MAX, GROUP_CONCAT)

  • SPARQL functions (STR, IRI, LANG, REGEX)

  • SPARQL data type support and their mapping to SQL types

  • SUBQUERY

The most common scenario for using data virtualization is when the integrated data is highly dynamic or too big to be replicated. For practical reasons, it is easier to not copy it and accept all limitations like data quality, integrity, and type of supported queries of the underlying information source.

A second common scenario is to maintain a declarative mapping between the relational model and RDF, where the user periodically dumps all statements and writes them to a native RDF database so it can support property paths and faster data joins.

Note

The virtual repository has the following specifics:

  • it is read-only, meaning that write operations cannot be executed in it;

  • COUNT queries cannot be executed;

  • sameAs is disabled;

  • executing an explain plan is disabled, meaning that graph queries are converted to simple SELECT queries without the graph segment. This will convert a graph query of the type

    SELECT * from <some_graph> WHERE {
       ?s ?p ?o .
    }
    

    to

    SELECT * WHERE {
       ?s ?p ?o .
    }
    

See more about the Ontop framework in its official documentation.

Usage scenario

Exposing a virtual endpoint as a repository in GraphDB is done in the following way:

The relational database is loaded in an RDBMS of your choice. After that, a relational database JDBC driver is necessary (e.g., PostgreSQL JDBC driver). It is placed in the lib directory of the GraphDB distribution.

Four additional files are needed as well:

  1. An OBDA or R2RML file describing the mapping of SPARQL queries to SQL data

  2. An OWL file describing the ontology of your data (optional)

  3. A properties file for the configuration of the JDBC driver parameters of the following type (here with example values from the sample data we will look at further down in this tutorial):

    jdbc.url=<database-jdbc-driver-connection-string>
    jdbc.driver=<database-jdbc-driver-class>
    jdbc.user=<your-database-username>
    jdbc.password=<your-database-password>
    
  4. A repository config file of the following type, here again with example values (optional):

    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix rep: <http://www.openrdf.org/config/repository#> .
    @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
    
    <#university-virtual> a rep:Repository;
      rep:repositoryID "university-virtual";
      rep:repositoryImpl [
          <http://inf.unibz.it/krdb/obda/quest#obdaFile> "university.obda";
          <http://inf.unibz.it/krdb/obda/quest#owlFile> "university.ttl";
          <http://inf.unibz.it/krdb/obda/quest#propertiesFile> "university.properties";
          rep:repositoryType "graphdb:OntopRepository"
        ];
      rdfs:label "Ontop virtual store with OBDA" .
    

    that references the aforementioned OBDA (or R2RML), ontology, and properties files. This file is automatically generated when creating a virtual repository through the Workbench, and is used when creating such a repository via cURL command as described further below.

These files are used to create a virtual repository in GraphDB, in which you can then query the relational database.

Let’s consider the following relational database containing university data.

It has tables describing students, academic staff, courses and two relation schemas (uni1 and uni2) with many-to-many links between academic staff -> course and students -> course. The descriptions below are for the uni1 tables.

uni1.student

s_id

first_name

last_name

1

Mary

Smith

2

John

Doe

This table contains the local ID, first and last names of the students. The column s_id is a primary key.

uni1.academic

a_id

first_name

last_name

position

1

Anna

Chambers

1

2

Edward

May

9

3

Rachel

Ward

8

Similarly, this table contains the local ID, first and last names of the academic staff, but also information about their position. The column a_id is a primary key.

The column position is populated with magic numbers:

  • 1 -> Full Professor

  • 2 -> Associate Professor

  • 3 -> Assistant Professor

  • 8 -> External Teacher

  • 9 -> PostDoc

uni1.course

c_id

title

1234

Linear Algebra

This table contains the local ID and the title of the courses. The column c_id is a primary key.

uni1.teaching

c_id

a_id

1234

1

1234

2

This table contains the n-n relation between courses and teachers. There is no primary key, but two foreign keys to the tables uni1.course and uni1.academic.

uni1.course-registration

c_id

s_id

1234

1

1234

2

This table contains the n-n relation between courses and students. There is no primary key, but two foreign keys to the tables uni1.course and uni1.student.

Setup and configuration

JDBC driver

As mentioned above, in order to create a virtual repository in GraphDB, you need to first install a JDBC driver for your respective relational database.

Place the driver .jar file in the lib directory of the GraphDB distribution. In case you are using GraphDB from a native installation, the driver file name should be jdbc-driver.jar. Restart GraphDB if it is running.

Configuration files

Before creating a virtual repository, you will need the following files (available for download below):

Creating a virtual repository from the Workbench

With generic JDBC driver

  1. When creating a repository from the Workbench, select the Ontop option.

  2. GraphDB supports several database JDBC drivers. When creating an Ontop repository, the default setting is Generic JDBC Driver. This means that you need to configure and upload your own JDBC properties file (available as a template for download above).

  3. In the fields for JDBC properties file and OBDA or R2RML file, upload the corresponding files. The Ontology and Constraint files are optional.

  4. You can also test the connection to your SQL database with the button on the right.

  5. Click Create.

_images/create-ontop-repo.png

Note

Once you have created an Ontop repository, its type cannot be changed.

With one of the other supported database drivers

For ease of use, GraphDB also supports drivers for five other commonly used databases integrated into the Ontop framework: MySQL, PostgreSQL, Oracle, MS SQL Server, and DB2. Selecting one of them offers the advantage of not having to configure the JDBC properties file yourself, as its Driver class and URL property values are generated by GraphDB.

To use one of these database drivers:

  1. Select the type of SQL database you want to use from the drop-down menu.

  2. Download the corresponding driver by clicking the Download JDBC driver link on the right of the Driver class field, place it in the lib directory of the GraphDB distribution, and restart GraphDB if it is running.

  3. Fill in the required fields for each driver (Hostname, Database name, etc.).

  4. Upload the OBDA/R2RML file. (The Ontology and Constraint files are optional, just as with the generic JDBC driver)

  5. You can also test the connection to your SQL database with the button on the right.

  6. Click Create.

_images/create-ontop-repo-other-drivers.png

Creating a virtual repository using cURL

To create a virtual repository via the API, you need the following files described above, all placed in the same directory (here, we are using the universities examples again):

As mentioned earlier, the OWL ontology and the constraint files are optional.

Execute the following cURL command:

curl -X POST http://localhost:7200/rest/repositories -H 'Content-Type: multipart/form-data' -F "config=@repo-config.ttl" -F "obdaFile=@university.obda" -F "propertiesFile=@university.properties"

You will see the newly created repository under Setup ‣ Repositories in the GraphDB Workbench.

Mapping language

The underlying Ontop engine supports two mapping languages. The first one is the official W3C RDB2RDF mapping language known as R2RML, which provides excellent interoperability between the various tools. The second one is the native Ontop mapping known as OBDA, which is much shorter and easier to learn, and supports an automatic bidirectional transformation to R2RML.

Mappings represent OWL assertions: one set of OWL assertions for each result row is returned by the SQL query in the mapping. The assertions are those that are obtained by replacing the placeholders with the values from the relational database.

Mappings consist of:

  • source: a SQL query that retrieves some data from the database

  • target: a form of template that indicates how to generate OWL assertions in a Turtle-like syntax.

All examples in this documentation use the internal OBDA mapping language.

Let’s map the uni1-student table using an OBDA template.

The information source is the following:

SELECT *
FROM "uni1"."student"

And the target mapping file is:

ex:uni1/student/{s_id} a :Student ;
    foaf:firstName {first_name}^^xsd:string ;
    foaf:lastName {last_name}^^xsd:string .

The target part is described using a Turtle-like syntax while the source part is a regular SQL query.

We used the primary key s_id to create the URI. This practice enables Ontop to remove self-joins, which is very important for optimizing the query performance.

This entry could be split into three mapping assertions:

ex:uni1/student/{s_id} a :Student .
ex:uni1/student/{s_id} foaf:firstName {first_name}^^xsd:string .
ex:uni1/student/{s_id} foaf:lastName {last_name}^^xsd:string .

Mapping the uni1-course table would look as follows:

The source will be:

SELECT *
FROM "uni1"."course"

And the target:

ex:uni1/course/{c_id} a :Course ;
    :title {title} ;
    :isGivenAt ex:uni1/university .

SPARQL endpoint

Below are some examples of the SPARQL queries that are supported in a GraphDB virtual repository.

  1. Return the IDs of all persons that are faculty members:

    PREFIX : <http://example.org/voc#>
    
    SELECT ?p
    WHERE {
      ?p a :FacultyMember .
    }
    
  2. Return the IDs of all full Professors together with their first and last names:

    PREFIX : <http://example.org/voc#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    
    SELECT DISTINCT ?prof ?lastName ?firstName {
      ?prof a :FullProfessor ;
            foaf:firstName ?firstName ;
            foaf:lastName ?lastName .
    }
    
  3. Return all Associate Professors, Assistant Professors, and Full Professors with their last names and first name if available, and the title of the course they are teaching:

    PREFIX : <http://example.org/voc#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    
    SELECT ?title ?fName ?lName {
      ?teacher rdf:type :Professor .
      ?teacher :teaches ?course .
      ?teacher foaf:lastName ?lName .
    
      ?course :title ?title .
      OPTIONAL {
        ?teacher foaf:firstName ?fName .
      }
    }
    

Query federation

GraphDB also supports querying the virtual read-only repositories using the highly efficient Internal SPARQL Federation.

Its usage is the same as with the internal federation of regular repositories. Instead of providing a URL to a remote repository, you need to provide a special URL of the form repository:NNN, where NNN is the ID of the virtual repository you want to access.

Let’s see how this works with our university database example.

  1. Create a new, empty RDF repository called university-rdf.

  2. From the ontop_repo virtual repository with university data, insert some data in the new, empty university-rdf repository: teachers with first name and last name that give courses that are not held at university2:

    PREFIX : <http://example.org/voc#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    insert {
        ?person a :UniversityTeacher;
                :firstName ?firstName;
                :lastName ?lastName .
    } where {
        service <repository:ontop_repo> {
            SELECT DISTINCT ?person ?firstName ?lastName
            WHERE {
               ?person foaf:firstName ?firstName ;
                  foaf:lastName ?lastName ;
                  :teaches [ :isGivenAt ?institution ]
                FILTER(?institution != "http://example.org/voc#uni2/university")
            }
        }
    }
    
  3. To observe the results, again in the university-rdf repository, execute the following query that will return the teachers that were inserted with their first and last name:

    PREFIX : <http://example.org/voc#>
    SELECT * WHERE {
        ?teacherId a :UniversityTeacher;
            :firstName ?firstName;
            :lastName ?lastName;
    } LIMIT 200
    

    Result:

    _images/ontop-federation-query2.png
  4. Then:

  • get the teachers from the virtual repository that teach courses in an institution that is not university2

  • merge the result of that with the RDF repository by getting the firstName and lastName of those teachers

  • the IDs of the teachers are the common property for both repositories which makes the selection possible. For the purposes of our demonstration, this query filters them by firstName that contains the letter “a”.

    PREFIX : <http://example.org/voc#>
    select * where {
        SERVICE <repository:ontop_repo> {
            ?teacherId :teaches [ :isGivenAt ?institution] .
            FILTER (?institution != "http://example.org/voc#uni2/university")
        }
        ?teacherId :firstName ?firstName;
                :lastName ?lastName
        FILTER (regex(?firstName, "a"))
    }
    

    Result:

    _images/ontop-federation-query3.png

Limitations

Data virtualization also comes with certain limitations due to the distributed nature of the data. In this sense, it works best for information that requires little or no integration. For instance, if in databases X and Y, we have two instances of the person John Smith, which do not share a unique key or other exact match attributes like “John Smith” and “John E. Smith”, it will be quite inefficient to match the records at runtime.

One potential drawback is also the type of supported queries. If the underlying storage has no indexes, it will be slow to answer queries such as “tell me how resource X connects to resource Y”.

The number of stacked data sources also significantly affects the efficiency of data retrieval.

Lastly, it is not possible to efficiently perform auto-suggest/auto-complete type of indexes nor graph traversals or inferencing.