# Virtualization¶

## Overview and features¶

The data virtualization in GraphDB enables direct access to relational databases with SPARQL queries, which eliminates the need to replicate data. The implementation exposes a virtual SPARQL endpoint, which translates the queries to SQL using a declarative mapping. To achieve this functionality, GraphDB integrates with the open-source Ontop project and extends it with multiple GraphDB specific features.

The following SPARQL features are supported:

• SELECT and CONSTRUCT queries

• Default and named graph triple patterns

• Triple pattern combining: OPTIONAL, UNION, blank node path

• Result filtering and value bindings: FILTER, BIND, VALUES

• Projection modifiers: DISTINCT, LIMIT, ORDER BY

• Aggregates (GROUP BY, SUM, COUNT, AVG, MIN, MAX, GROUP_CONCAT)

• SPARQL functions (STR, IRI, LANG, REGEX)

• SPARQL data type support and their mapping to SQL types

• SUBQUERY

The most common scenario for using data virtualization is when the integrated data is highly dynamic or too big to be replicated. For practical reasons, it is easier to not copy it and accept all limitations like data quality, integrity, and type of supported queries of the underlying information source.

A second common scenario is to maintain a declarative mapping between the relational model and RDF, where the user periodically dumps all statements and writes them to a native RDF database so it can support property paths and faster data joins.

Note

The virtual repository has the following specifics:

• it is read-only, meaning that write operations cannot be executed in it;

• COUNT queries cannot be executed;

• sameAs is disabled;

• executing an explain plan is disabled, meaning that graph queries are converted to simple SELECT queries without the graph segment. This will convert a graph query of the type

SELECT * from <some_graph> WHERE {
?s ?p ?o .
}


to

SELECT * WHERE {
?s ?p ?o .
}


See more about the Ontop framework in its official documentation.

## Usage scenario¶

Exposing a virtual endpoint as a repository in GraphDB is done in the following way:

The relational database is loaded in an RDBMS of your choice. After that, a relational database JDBC driver is necessary (e.g., PostgreSQL JDBC driver).

Four additional files are needed as well:

1. An OBDA or R2RML file describing the mapping of SPARQL queries to SQL data

2. An OWL file describing the ontology of your data (optional)

3. A properties file for the configuration of the JDBC driver parameters of the following type (here with example values from the sample data we will look at further down in this tutorial):

jdbc.url=<database-jdbc-driver-connection-string>
jdbc.driver=<database-jdbc-driver-class>

4. A repository config file of the following type, here again with example values (optional):

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rep: <http://www.openrdf.org/config/repository#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<#university-virtual> a rep:Repository;
rep:repositoryID "university-virtual";
rep:repositoryImpl [
<http://inf.unibz.it/krdb/obda/quest#obdaFile> "university.obda";
<http://inf.unibz.it/krdb/obda/quest#owlFile> "university.ttl";
<http://inf.unibz.it/krdb/obda/quest#propertiesFile> "university.properties";
rep:repositoryType "graphdb:OntopRepository"
];
rdfs:label "Ontop virtual store with OBDA" .


that references the aforementioned OBDA (or R2RML), ontology, and properties files. This file is automatically generated when creating a virtual repository through the Workbench, and is used when creating such a repository via cURL command as described further below.

5. A database metadata file that provides information about uniqueness and non-null constraints for database tables and views (optional).

These files are used to create a virtual repository in GraphDB, in which you can then query the relational database.

Let’s consider the following relational database containing university data.

It has tables describing students, academic staff, courses and two relation schemas (uni1 and uni2) with many-to-many links between academic staff -> course and students -> course. The descriptions below are for the uni1 tables.

uni1.student

s_id

first_name

last_name

1

Mary

Smith

2

John

Doe

This table contains the local ID, first and last names of the students. The column s_id is a primary key.

a_id

first_name

last_name

position

1

Anna

Chambers

1

2

Edward

May

9

3

Rachel

Ward

8

Similarly, this table contains the local ID, first and last names of the academic staff, but also information about their position. The column a_id is a primary key.

The column position is populated with magic numbers:

• 1 -> Full Professor

• 2 -> Associate Professor

• 3 -> Assistant Professor

• 8 -> External Teacher

• 9 -> PostDoc

uni1.course

c_id

title

1234

Linear Algebra

This table contains the local ID and the title of the courses. The column c_id is a primary key.

uni1.teaching

c_id

a_id

1234

1

1234

2

This table contains the n-n relation between courses and teachers. There is no primary key, but two foreign keys to the tables uni1.course and uni1.academic.

uni1.course-registration

c_id

s_id

1234

1

1234

2

This table contains the n-n relation between courses and students. There is no primary key, but two foreign keys to the tables uni1.course and uni1.student.

## Setup and configuration¶

### JDBC driver¶

As mentioned above, in order to create a virtual repository in GraphDB, you need to first install a JDBC driver for your respective relational database.

In the lib directory of the GraphDB distribution, create a subdirectory called jdbc and place the driver .jar file there. In case you are using GraphDB from a native installation, the driver file name should be jdbc-driver.jar.

Note

The driver can also be placed in the lib directory; however, this requires a restart of GraphDB.

If you want to set a JDBC driver directory different from the lib/jdbc location, you can define it via the graphdb.ontop.jdbc.path property in the conf/graphdb.properties file of the GraphDB distribution.

### Configuration files¶

Before creating a virtual repository, you will need the following files (available for download below):

### Creating a virtual repository from the Workbench¶

#### With generic JDBC driver¶

1. When creating a repository from the Workbench, select the Ontop option.

2. GraphDB supports several database JDBC drivers. When creating an Ontop repository, the default setting is Generic JDBC Driver. This means that you need to configure and upload your own JDBC properties file (available as a template for download above).

3. In the fields for JDBC properties file and OBDA or R2RML file, upload the corresponding files. The Ontology, Constraint, and DB metadata files are optional.

4. You can also test the connection to your SQL database with the button on the right.

5. Click Create.

Note

Once you have created an Ontop repository, its type cannot be changed.

#### With one of the other supported database drivers¶

For ease of use, GraphDB also supports drivers for five other commonly used databases integrated into the Ontop framework: MySQL, PostgreSQL, Oracle, MS SQL Server, and DB2. Selecting one of them offers the advantage of not having to configure the JDBC properties file yourself, as its Driver class and URL property values are generated by GraphDB.

To use one of these database drivers:

1. Select the type of SQL database you want to use from the drop-down menu.

2. Download the corresponding driver by clicking the Download JDBC driver link on the right of the Driver class field, place it in the lib directory of the GraphDB distribution, and restart GraphDB if it is running.

3. Fill in the required fields for each driver (Hostname, Database name, etc.).

4. Upload the OBDA/R2RML file. (The Ontology, Constraint, and DB metadata files are optional, just as with the generic JDBC driver)

5. You can also test the connection to your SQL database with the button on the right.

6. Click Create.

### Creating a virtual repository using cURL¶

To create a virtual repository via the API, you need the following files described above, all placed in the same directory (here, we are using the universities examples again):

As mentioned earlier, the OWL ontology, constraint, and DB metadata files are optional.

Execute the following cURL command (here including the DB metadata file):

curl -X POST http://localhost:7200/rest/repositories -H 'Content-Type: multipart/form-data' -F "config=@repo-config.ttl" -F "obdaFile=@university.obda" -F "propertiesFile=@university.properties" -F "dbMetadataFile=@university-complete-db-metadata.json"


You will see the newly created repository under Setup ‣ Repositories in the GraphDB Workbench.

## Mapping language¶

The underlying Ontop engine supports two mapping languages. The first one is the official W3C RDB2RDF mapping language known as R2RML, which provides excellent interoperability between the various tools. The second one is the native Ontop mapping known as OBDA, which is much shorter and easier to learn, and supports an automatic bidirectional transformation to R2RML.

Mappings represent OWL assertions: one set of OWL assertions for each result row is returned by the SQL query in the mapping. The assertions are those that are obtained by replacing the placeholders with the values from the relational database.

Mappings consist of:

• source: a SQL query that retrieves some data from the database

• target: a form of template that indicates how to generate OWL assertions in a Turtle-like syntax.

All examples in this documentation use the internal OBDA mapping language.

Let’s map the uni1-student table using an OBDA template.

The information source is the following:

SELECT *
FROM "uni1"."student"


And the target mapping file is:

ex:uni1/student/{s_id} a :Student ;
foaf:firstName {first_name}^^xsd:string ;
foaf:lastName {last_name}^^xsd:string .


The target part is described using a Turtle-like syntax while the source part is a regular SQL query.

We used the primary key s_id to create the URI. This practice enables Ontop to remove self-joins, which is very important for optimizing the query performance.

This entry could be split into three mapping assertions:

ex:uni1/student/{s_id} a :Student .
ex:uni1/student/{s_id} foaf:firstName {first_name}^^xsd:string .
ex:uni1/student/{s_id} foaf:lastName {last_name}^^xsd:string .


Mapping the uni1-course table would look as follows:

The source will be:

SELECT *
FROM "uni1"."course"


And the target:

ex:uni1/course/{c_id} a :Course ;
:title {title} ;
:isGivenAt ex:uni1/university .


## SPARQL endpoint¶

Below are some examples of the SPARQL queries that are supported in a GraphDB virtual repository.

1. Return the IDs of all persons that are faculty members:

PREFIX voc: <http://example.org/voc#>

SELECT ?p
WHERE {
?p a voc:FacultyMember .
}

2. Return the IDs of all full Professors together with their first and last names:

PREFIX voc: <http://example.org/voc#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT DISTINCT ?prof ?lastName ?firstName {
?prof a voc:FullProfessor ;
foaf:firstName ?firstName ;
foaf:lastName ?lastName .
}

3. Return all Associate Professors, Assistant Professors, and Full Professors with their last names and first name if available, and the title of the course they are teaching:

PREFIX voc: <http://example.org/voc#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?title ?fName ?lName {
?teacher rdf:type voc:Professor .
?teacher voc:teaches ?course .
?teacher foaf:lastName ?lName .

?course voc:title ?title .
OPTIONAL {
?teacher foaf:firstName ?fName .
}
}


## Query federation¶

GraphDB also supports querying the virtual read-only repositories using the highly efficient Internal SPARQL federation.

Its usage is the same as with the internal federation of regular repositories. Instead of providing a URL to a remote repository, you need to provide a special URL of the form repository:NNN, where NNN is the ID of the virtual repository you want to access.

Let’s see how this works with our university database example.

1. Create a new, empty RDF repository called university-rdf.

2. From the ontop_repo virtual repository with university data, insert some data in the new, empty university-rdf repository: teachers with first name and last name that give courses that are not held at university2:

PREFIX voc: <http://example.org/voc#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
insert {
?person a voc:UniversityTeacher;
voc:firstName ?firstName;
voc:lastName ?lastName .
} where {
service <repository:my_repo> {
SELECT DISTINCT ?person ?firstName ?lastName
WHERE {
?person foaf:firstName ?firstName ;
foaf:lastName ?lastName ;
voc:teaches [ voc:isGivenAt ?institution ]
FILTER(?institution != <http://example.org/voc#uni2/university>)
}
}
}

3. To observe the results, again in the university-rdf repository, execute the following query that will return the teachers that were inserted with their first and last name:

PREFIX voc: <http://example.org/voc#>
SELECT * WHERE {
?teacherId a voc:UniversityTeacher;
voc:firstName ?firstName;
voc:lastName ?lastName;
} LIMIT 200


Result:

4. Then:

• get the teachers from the virtual repository that teach courses in an institution that is not university2

• merge the result of that with the RDF repository by getting the firstName and lastName of those teachers

• the IDs of the teachers are the common property for both repositories which makes the selection possible. For the purposes of our demonstration, this query filters them by firstName that contains the letter “a”.

PREFIX voc: <http://example.org/voc#>
select * where {
SERVICE <repository:ontop_repo> {
?teacherId voc:teaches [ voc:isGivenAt ?institution] .
FILTER (?institution != "http://example.org/voc#uni2/university")
}
?teacherId voc:firstName ?firstName;
voc:lastName ?lastName
FILTER (regex(?firstName, "a"))
}


Result:

## Limitations¶

Data virtualization also comes with certain limitations due to the distributed nature of the data. In this sense, it works best for information that requires little or no integration. For instance, if in databases X and Y, we have two instances of the person John Smith, which do not share a unique key or other exact match attributes like “John Smith” and “John E. Smith”, it will be quite inefficient to match the records at runtime.

One potential drawback is also the type of supported queries. If the underlying storage has no indexes, it will be slow to answer queries such as “tell me how resource X connects to resource Y”.

The number of stacked data sources also significantly affects the efficiency of data retrieval.

Lastly, it is not possible to efficiently perform auto-suggest/auto-complete type of indexes nor graph traversals or inferencing.