Virtualization

Overview and features

GraphDB’s data virtualization capabilities enable direct access using SPARQL to relational databases, eliminating the need to replicate data. Data virtualization is a useful tool when working with highly dynamic data, as well as data sources that are too big to be replicated. Moreover, in practice it is easier not to copy data and accept all limitations such as data quality, integrity, and types of supported queries of the underlying information source.

A common scenario is to maintain a declarative mapping between the relational model and RDF, where the user periodically dumps all statements and writes them to a native RDF database so it can support property paths and faster data joins.

The implementation exposes a virtual SPARQL endpoint, which translates the queries to SQL using declarative mapping. This is achieved through an integration with the open-source Ontop project, extended with multiple GraphDB-specific features.

GraphDB uses Ontop version 5.0.2. As of June 2023, Ontop supports the following database systems: PostgreSQL, MySQL, MariaDB (since 5.0.0), SQL Server, Oracle, DB2, Snowflake (since 5.0.0), Databricks (since 5.0.0), Google BigQuery (since 5.0.2), AWS Redshift (since 5.0.2) and DuckDB (since 5.0.2); and database federators such as Denodo, Dremio (since 4.1.0), Teiid (since 4.1.1), Apache Spark (since 4.2.0) and Trino / PrestoDB / AWS Athena (since 5.0.2).

See more about the Ontop framework in its official documentation.

Setup and configuration

To expose a virtual endpoint as a repository in GraphDB, you must first load the relational database into an RDBMS of your choice.

JDBC driver

In order to create a virtual repository in GraphDB, you need to first install a JDBC driver for your respective relational database (e.g., PostgreSQL JDBC driver).

In the lib directory of the GraphDB distribution, create a subdirectory called jdbc and place the driver .jar file there.

Note

The driver can also be placed in the lib directory; however, this requires a restart of GraphDB.

If you want to set a JDBC driver directory different from the lib/jdbc location, you can define it via the graphdb.ontop.jdbc.path property in the conf/graphdb.properties file of the GraphDB distribution.

Configuration files

In addition to the relational database and the JDBC driver, there are several other files used to create a virtual repository.

Note

You can download examples versions of those files in the usage scenario section below.

Required files

  1. An R2RML or OBDA mapping file describing the mapping of SPARQL queries to SQL data

Optional files

  1. An OWL ontology file describing the ontology of your data

  2. A constraint file providing user-supplied implicit DB Constraints (e.g. unique constraints and foreign keys) to Ontop to generate efficient SQL

  3. An Ontop lenses file that specifies integrity constraints on views defined at the Ontop level

  4. A DB metadata file that provides extracted database information such as DB connection metadata, DB schema and integrity constraints (for instance primary keys)

Automatically generated files

There are also several files that are automatically generated when creating a virtual repository through the Workbench. Unless you are creating such a repository via cURL command as described further below, you do not need to worry about those files.

  1. A properties file for the configuration of the JDBC driver parameters of the following type:

    jdbc.url=<database-jdbc-driver-connection-string>
    jdbc.driver=<database-jdbc-driver-class>
    jdbc.user=<your-database-username>
    jdbc.password=<your-database-password>
    
  2. A repository config file of the following type that references the aforementioned properties, R2RML or OBDA mapping, ontology, constraints, lenses, and DB metadata files:

    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix rep: <http://www.openrdf.org/config/repository#> .
    @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
    
    <#example-virtual> a rep:Repository;
      rep:repositoryID "example-virtual";
      rep:repositoryImpl [
          <http://inf.unibz.it/krdb/obda/quest#propertiesFile> "example.properties";
          <http://inf.unibz.it/krdb/obda/quest#obdaFile> "example-mapping.obda";
          <http://inf.unibz.it/krdb/obda/quest#owlFile> "example-ontology.ttl";
          <http://inf.unibz.it/krdb/obda/quest#constraintFile> "example-constraints.lst";
          <http://inf.unibz.it/krdb/obda/quest#lensesFile> "example-lenses.json";
          <http://inf.unibz.it/krdb/obda/quest#dbMetadataFile> "example-db-metadata.json";
          rep:repositoryType "graphdb:OntopRepository"
        ];
      rdfs:label "Ontop virtual store with OBDA" .
    

Ontop configuration properties and query logging

Ontop configuration properties can be configured through the GraphDB config file using properties of the kind graphdb.ontop.<property>. The graphdb. part will be stripped and the resultant ontop.<property> property will be passed to Ontop.

For example, to configure Ontop query logging, set the following properties:

# Enables query logging
graphdb.ontop.queryLogging = true

# Includes the SPARQL query string into the query log (true by default)
graphdb.ontop.queryLogging.includeSparqlQuery = true

# Includes the reformulated (SQL) query into the query log (false by default)
graphdb.ontop.queryLogging.includeReformulatedQuery = true

The Ontop log will be written the same way as any other GraphDB log messages.

Creating a virtual repository from the Workbench

To create a new virtual repository, select the Ontop option from the repositories creation menu.

When creating an Ontop repository in GraphDB, the default setting is Generic JDBC Driver where you need to specify the Driver class yourself. However, you can also create an Ontop repository by using the predefined settings for one of several commonly used drivers.

Warning

Once you have created an Ontop repository, its type cannot be changed.

With a generic JDBC driver

Connection information

  1. After you create a new Ontop repository, leave the default setting as Generic JDBC Driver.

  2. Enter your Username and Password (if applicable).

  3. Define your Driver class.

  4. Enter your JDBC URL.

  5. You can also test the connection to your SQL database with the button on the right.

_images/create-ontop-repo-connection-info.png

Ontop settings

  1. Upload the R2RML or OBDA file in its respective field. The Ontology, Constraint, Lenses, and DB metadata files are optional.

  2. To include optional properties, type them in the field for Additional Ontop/JDBC properties, using the properties file format.

  3. You can also test the connection to your SQL database with the button on the right.

  4. Click Create.

_images/create-ontop-repo-ontop-settings.png

With one of the other supported database drivers

For ease of use, GraphDB also supports drivers for eight other commonly used databases integrated into the Ontop framework: MySQL, PostgreSQL, Oracle, MS SQL Server, DB2, Dremio, Databricks, and Snowflake. Those options come with predefined Driver class properties. In addition to that, when working with one of those options, the URL property values are generated by GraphDB.

To use one of these database drivers:

Connection information

  1. Select the type of SQL database you want to use from the drop-down menu.

  2. Fill in the required fields for each driver (Hostname, Database name, etc.).

  3. Optionally, fill in the Port, Username and Password properties.

Note

After you select the type of SQL database, you can download the corresponding driver by clicking the Download JDBC driver link on the right of the Driver class field. For more information on how to set up your JDBC driver, read the JDBC driver section above.

Note

Although you cannot modify the existing provided template, you can still add to the JDBC URL field value after its automatically generated. This is required when working with complex data sources (such as Databricks) which expect additional properties in order to establish the connection.

_images/create-ontop-repo-supported-drivers.png

Ontop settings

  1. Upload the R2RML or OBDA file in its respective field. The Ontology, Constraint, Lenses, and DB metadata files are optional.

  2. To include optional properties, type them in the field for Additional Ontop/JDBC properties, using the properties file format.

  3. You can also test the connection to your SQL database with the button on the right.

  4. Click Create.

Creating a virtual repository using cURL

To create a virtual repository via the API, you need the three required files. If necessary, you can also include one or more of the optional files:

  • example-repo-config.ttl: the config file for the repository, required

  • example.properties: the JDBC properties file, required

  • example-mapping.obda: the R2RML/OBDA file, required

  • example-ontology.ttl: the Ontop ontology file, optional

  • example-constraints.lst: the Ontop constraint file, optional

  • example-lenses.json: the Ontop lenses file, optional

  • example-db-metadata.json: the Ontop DB metadat file, optional

You can see a description of each file in virtualization configuration files section above.

Execute the following cURL command (here including the required and all optional files):

curl -X POST http://localhost:7200/rest/repositories -H "Content-Type: multipart/form-data"\
    -F "config=@example-repo-config.ttl"\
    -F "propertiesFile=@example.properties"\
    -F "obdaFile=@example-mapping.obda"\
    -F "owlFile=@example-ontology.ttl"\
    -F "constraintFile=@example-constraints.lst"\
    -F "lensesFile=@example-lenses.json"\
    -F "dbMetadataFile=@example-db-metadata.json"

You will see the newly created repository under Setup ‣ Repositories in the GraphDB Workbench.

Mapping language

The underlying Ontop engine supports two mapping languages. The first one is the official W3C RDB2RDF mapping language known as R2RML, which provides excellent interoperability between the various tools. The second one is the native Ontop mapping known as OBDA, which is much shorter and easier to learn, and supports an automatic bidirectional transformation to R2RML.

SPARQL endpoint

For a comprehensive list of supported SPARQL features, please see Ontop’s documentation on standard compliance.

You can see examples of SPARQL queries supported by GraphDB virtual repositories under Usage scenario.

Query federation

GraphDB also supports querying the virtual read-only repositories using the highly efficient Internal SPARQL federation.

Its usage is the same as with the internal federation of regular repositories. Instead of providing a URL to a remote repository, you need to provide a special URL of the form repository:NNN, where NNN is the ID of the virtual repository you want to access.

You can see an example usage scenario of query federation under Usage scenario.

Usage scenario

Let’s use a relational database containing university data.

Note

You can download the required files for this usage scenario below:
  1. The university SQL data (import it into an SQL database of your choice).

  2. OBDA mapping file.

  3. Ontology file.

  4. DB metadata file.

  5. Config file for the repository (necessary when creating a virtual repository using cURL).

  6. JDBC Properties file (necessary when creating a virtual repository using cURL).

It has tables describing students, academic staff, courses and two relation schemas (uni1 and uni2) with many-to-many links between academic staff → course and students → course. The descriptions below are for the uni1 tables.

uni1.student

s_id

first_name

last_name

1

Mary

Smith

2

John

Doe

This table contains the local ID, first and last names of the students. The column s_id is a primary key.

uni1.academic

a_id

first_name

last_name

position

1

Anna

Chambers

1

2

Edward

May

9

2

Rachel

Ward

8

Similarly, this table contains the local ID, first and last names of the academic staff, but also information about their position. The column a_id is a primary key.

The column position is populated with magic numbers:

  • 1 → Full Professor

  • 2 → Associate Professor

  • 3 → Assistant Professor

  • 8 → External Teacher

  • 9 → PostDoc

uni1.course

c_id

title

1234

Linear Algebra

This table contains the local ID and the title of the courses. The column c_id is a primary key.

uni1.teaching

c_id

a_id

1234

1

1234

2

This table contains the n-n relation between courses and teachers. There is no primary key, but two foreign keys to the tables uni1.course and uni1.academic.

uni1.course-registration

c_id

s_id

1234

1

1234

2

This table contains the n-n relation between courses and students. There is no primary key, but two foreign keys to the tables uni1.course and uni1.student.

SPARQL example queries

Below are some examples of the SPARQL queries that are supported in a GraphDB virtual repository.

  1. Return the IDs of all persons that are faculty members:

    PREFIX voc: <http://example.org/voc#>
    
    SELECT ?p {
        ?p a voc:FacultyMember .
    }
    
  2. Return the IDs of all full Professors together with their first and last names:

    PREFIX voc: <http://example.org/voc#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    
    SELECT DISTINCT ?prof ?lastName ?firstName {
        ?prof a voc:FullProfessor ;
              foaf:firstName ?firstName ;
              foaf:lastName ?lastName .
    }
    
  3. Return all Associate Professors, Assistant Professors, and Full Professors with their last names and first name if available, and the title of the course they are teaching:

    PREFIX voc: <http://example.org/voc#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    
    SELECT ?title ?fName ?lName {
        ?teacher rdf:type voc:Professor .
        ?teacher voc:teaches ?course .
        ?teacher foaf:lastName ?lName .
    
        ?course voc:title ?title .
        OPTIONAL {
            ?teacher foaf:firstName ?fName .
        }
    }
    

Query federation example

Let’s see how querying using Internal SPARQL federation works with our university database example.

  1. Create a new, empty RDF repository called university-rdf.

  2. From the ontop_repo virtual repository with university data, insert some data in the new, empty university-rdf repository: teachers with first name and last name that give courses that are not held at university2:

    PREFIX voc: <http://example.org/voc#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    
    INSERT {
        ?person a voc:UniversityTeacher;
                voc:firstName ?firstName;
                voc:lastName ?lastName .
    } WHERE {
        SERVICE <repository:my_repo> {
            SELECT DISTINCT ?person ?firstName ?lastName {
                ?person foaf:firstName ?firstName ;
                        foaf:lastName ?lastName ;
                        voc:teaches [ voc:isGivenAt ?institution ]
                FILTER(?institution != <http://example.org/voc#uni2/university>)
            }
        }
    }
    
  3. To observe the results, again in the university-rdf repository, execute the following query that will return the teachers that were inserted with their first and last name:

    PREFIX voc: <http://example.org/voc#>
    
    SELECT * {
        ?teacherId a voc:UniversityTeacher;
            voc:firstName ?firstName;
            voc:lastName ?lastName;
    }
    

    Result:

    _images/ontop-federation-query2.png
  4. Then:

  • get the teachers from the virtual repository that teach courses in an institution that is not university2

  • merge the result of that with the RDF repository by getting the firstName and lastName of those teachers

  • the IDs of the teachers are the common property for both repositories which makes the selection possible. For the purposes of our demonstration, this query filters them by firstName that contains the letter “a”.

    PREFIX voc: <http://example.org/voc#>
    
    SELECT * {
        SERVICE <repository:ontop_repo> {
            ?teacherId voc:teaches [ voc:isGivenAt ?institution] .
            FILTER (?institution != "http://example.org/voc#uni2/university")
        }
        ?teacherId voc:firstName ?firstName;
                   voc:lastName ?lastName
        FILTER (regex(?firstName, "a"))
    }
    

    Result:

    _images/ontop-federation-query3.png

Limitations

Data virtualization also comes with certain limitations due to the distributed nature of the data. In this sense, it works best for information that requires little or no integration. For instance, if in databases X and Y, we have two instances of the person John Smith, which do not share a unique key or other exact match attributes like “John Smith” and “John E. Smith”, it will be quite inefficient to match the records at runtime.

The virtual repository has the following specifics:

  • it is read-only, meaning that write operations cannot be executed in it;

  • COUNT queries cannot be executed;

  • sameAs is disabled;

  • running an Alpine Linux Based Docker GraphDB with DuckDB Driver will result in a Runtime Exception;

  • GraphDB explain plan is not available;

  • Named graphs are not supported and will be ignored if used (as illustrated by the example queries below);

    SELECT * from <some_graph> WHERE {
        ?s ?p ?o .
    }
    

    to

    SELECT * WHERE {
        ?s ?p ?o .
    }
    

One potential drawback is also the type of supported queries. If the underlying storage has no indexes, it will be slow to answer queries such as “tell me how resource X connects to resource Y”.

The number of stacked data sources also significantly affects the efficiency of data retrieval.

Lastly, it is not possible to efficiently perform auto-suggest/auto-complete type of indexes nor graph traversals or inferencing.