## OntoRefine – overview and features¶

GraphDB OntoRefine is an upgraded version of the open-source OpenRefine data transformation tool. It allows the quick mapping of any structured data to a locally stored RDF schema in GraphDB. The visual interface is optimized to guide you in choosing the right predicates and types, defining the datatype to RDF mappings, and implementing complex transformation using OpenRefine’s GREL language. GREL is the Google Refine Expression Language that helps you define complex transformation.

OntoRefine is integrated in the GraphDB Workbench, and supports the formats TSV, CSV, *SV, XLS, XLSX, JSON, XML, RDF as XML, and Google sheet. It enables you to:

• Create an RDF model of the cleaned data.

• Transform your data using SPIN functions.

• Further modify your RDFized data in the GraphDB SPARQL endpoint.

## Example data¶

For the examples in this guide, we will be working with the following dataset:

### Open OntoRefine in the Workbench¶

To transform your data into RDF, you need a working GraphDB database.

1. Start GraphDB in Workbench mode.

2. Open http://localhost:7200/ in a browser.

3. Create a repository.

4. Go to Import ‣ Tabular (OntoRefine).

All data files in OntoRefine are organized as projects. One project can have more than one data file.

The Create Project action area consists of three tabs corresponding to the source of data. You can upload a file from your computer, specify the URL of a publicly accessible data, or paste data from the clipboard.

### Create a project¶

1. Click Create Project ‣ Get data from.

2. Select one or more files to upload:

• from clipboard

3. Click Next.

4. (Optional) Change the table configurations and update the preview.

With the first opening of the file, OntoRefine tries to recognize the encoding of the text file and all delimiters.

5. Click Create Project.

### Import a project¶

To import an already existing OntoRefine project:

1. Go to Import Project.

2. Select a file (.tar or .tar.gz)

3. Import it.

### Open a project¶

Once the project is created:

1. Go to Open Project.

2. Click the one you want to work on.

3. (Optional) You can also delete your project if you want to.

The result of each of these actions is a table similar to that of an Excel or a Google sheet:

## RDFize tabular data¶

### Mapping interface¶

This walk-through will show you how to map your tabular data against an existing ontology, in this case the schema-org.rdf ontology. Upload it into the repository as shown here. After that, go to Setup ‣ Autocomplete and enable the autocomplete index. Return to the OntoRefine project.

The RDF Mapping button will take you to the mapping editor, which is an extension of the OntoRefine functionality. Here, you can:

• configure and preview the mapping model of your data

• upload a JSON model of a mapping

• convert the result of the mapping to RDF data that is downloaded as a .ttl file

• generate a SPARQL query of your mapping and open it in a GraphDB SPARQL endpoint

• create a new mapping

You can close the mapping editor with the X button on the top right.

The headers of all columns in the tabular data that we imported are displayed as boxes that you can drag and drop into a mapping cell to configure them.

Each row in the table represents an RDF triple constructed from the tabular data. If two (or more) triples have the same subject, they will be displayed as a triple with one subject and two (or more) predicates. Analogically, if two (or more) triples have the same subject + predicate, but different objects, they will be displayed as a triple with one subject + predicate and multiple different objects.

Important

To save your mapping model in Git or to automate the import of structured data into GraphDB, you can download the JSON model and generate the SPARQL query of your mapping, which can be downloaded in several formats including JSON, Turtle, and TriG. The mapping API will then reference these files.

### Value mapping¶

The value mapping describes how a single tabular cell is converted to an RDF value. Each such mapping requires a value source, a value type, and may have an optional value transformation.

#### Value type¶

The value type defines the type of RDF value that will be produced by the value mapping. The possible types are:

Resource (abstract)

An RDF resource. This is an abstract type that specifies the common features of IRIs and blank nodes. A resource value type may have type mappings and property mappings on its own.

IRI

An RDF IRI. the transformed value is the IRI with illegal characters escaped automatically. This is a subtype of the abstract Resource value type.

Blank node based on value (value Bnode)

An RDF blank node. The transformed value is used to calculate a reasonable blank node identifier such that identical transformed values produce the same blank node. This is a subtype of the abstract Resource value type.

Unique blank node (unique Bnode)

An RDF blank node. The transformed value is ignored and a unique blank node is created every time. This is a subtype of the abstract Resource value type.

Any literal (abstract)

Any kind of RDF literal (plain, language, or datatype). This is an abstract type that unifies all literal value types.

Literal

An RDF plain literal. The transformed value is the literal’s label.

Literal with a language

An RDF literal with a language. The transformed value is the literal’s label. The language is a simple literal value mapping, i.e., identical to a value mapping with type literal.

Literal with a datatype

An RDF literal with a datatype. The transformed value is the literal’s label. The datatype is a simple IRI value mapping.

#### Value source¶

Each value mapping has an associated value source: the row or record index, a column identified by name, or a constant value.

#### Value transformation¶

Each value mapping may have an optional transformation applied to the data received from the value source before the value is created. Each transformation has an associated language and expression. The expression is evaluated according to the rules of the language in the context of the value source. The languages are:

Language “prefix”

The expression is a namespace prefix applied to the value received from the source, if that value is not an absolute IRI already.

Language “grel”

The expression is a GREL expression.

### Prefixes¶

The mapping tree contains a set of prefixes that are used in the cell configuration. They are defined in the prefix area and can be of three types:

• default prefixes from commonly used RDF schemas, such as foaf, geo, rdf, rdfs, skos, xsd. You can select entities from these schemas without importing them in your repository.

• prefixes that you select from the imported ontology,

• and such that you create ourself.

Add the following prefixes in the Prefix field:

PREFIX dbo: <http://dbpedia.org/ontology/>

PREFIX amsterdam: <https://data/amsterdam/nl/resource/>


### Triple configuration¶

Tip

You may find it convenient to be able to preview the end RDF results while still being able to configure them. To do so, choose Both from the options on the top left.

Restaurant ID and type

For the RDF subject value, we will take the value of the Trcid column. You do not need to know all the headers of your data, as typing @ in the field will display a drop-down list of all available column headers.

Set the rdf:type for the predicate - a.

As object value, enter schema:Restaurant as present in the schema-org ontology.

Now let’s edit the subject value. The edit icon will open the Subject mapping configuration window. We can see that the Source for this value is the value of the Trcid column. In the Prefix field, type amsterdam:restaurant/, which will extend the predefined amsterdam prefix.

Hint

The preview of an IRI is a clickable link that points to the corresponding resource in the repository, e.g., the below restaurant ID.

To view the RDF results at any time, click the RDF button. This will download a result-triples.ttl file where we can see the @prefix amsterdam: <http://data.amsterdam.nl/resource/> namespace that we defined extended with restaurant, as well as the IDs of the restaurants that come from the values in the Trcid column.

Note

If you try to navigate away from the mapping screen, to close it, or to open a new mapping, a warning message will inform you that if you proceed with that action, all mappings will be lost.

As our goal here is to map the tabular data against the schema-org ontology, we will continue adding predicates and objects. Since the mapping table is modeled after the Turtle syntax, we can attach more than one predicate to one subject and more than one object to one predicate without having to repeat the same triples in every row.

Title

In the next row, let’s add the schema:title predicate by autocompleting it from the imported ontology: as Source, select Constant, and then enter for Prefix - schema and for Constant - title. The object’s RDF Type will be Literal, and its value will come from the Title column from the tabular data. As explained above, type @ and select Title from the invoked list.

We will also add another object that will be the title in English, i.e., TitleEN. When a type is a Literal, we can further specify its Literal attributes to be Language Literal or Datatype Literal. Here, we choose the secondary type to be Language Literal, which will configure a Language tag for that Literal.

Hint

Note that the names of the columns that are already being used in the mapping now appear in grey.

Description

In the next row, let’s add another predicate from the ontology – schema:description. As its object, we will use the value from the Shortdescription column.

If you download the .ttl file with the RDF results and open it, you will see that the data is now mapped the way we intended, e.g.:

@base <http://example/base/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix amsterdam: <https://data.amsterdam.nl/resource/> .
@prefix sf: <http://www.opengis.net/ont/sf#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dbo: <http://dbpedia.org/ontology/> .

<https://data.amsterdam.nl/resource/restaurant/669d7d82-8962-4e88-b2e1-7b8706633aa0>
a schema:Restaurant;
schema:title "Smits Noord-Zuid Hollandsch Koffiehuis", "Smits Noord-Zuid Hollandsch Koffiehuis"@en;
schema:description "Het Smits Koffiehuis ontleent haar ontstaan aan de stoomtram die de verbinding onderhield met Amsterdam naar het noorden van de provincie en is in 1919 gebouwd. Nu is er een restaurant en een koffiebar. Ook is hier een informatiekantoor van Amsterdam Marketing gehuisvest.";


One block corresponds to one row in the OntoRefine mapping table.

Latitude

Another predicate that we can use from the ontology is schema:latitude. Its object will be the value of the Latitude column, but we want to edit it the following way: RDF Type ‣ Literal, Source ‣ GREL with the following expression that will replace the comma in the latitude value with a decimal point: row.cells["Latitude"].value.replace(",", "."). Then Literal attributes ‣ Datatype, and Datatype source ‣ Constant with prefix xsd and constant float.

The same can be done for Longitude.

Note

When writing GREL expressions, keep in mind that column names are case-sensitive.

Zip code

Next, we will map the zip code of the data. Since there is no prefix for it in the schema-org ontology, we will create one – amsterdam:zipcode. The object value will come from the Zipcode column.

Image

We can also map the images in the data. As predicate, use the ontology prefix schema:image, for which the corresponding column from the tabular data is Media. We will edit it to be an IRI as it is an image location - Type ‣ IRI.

Geographical point (as a nested triple)

Another thing we can map is a geographical point for the restaurant. To do this, we will define a GeoSPARQL point constructed by the latitude and longitude. For predicate, select geo:hasGeometry by autocompleting it from the pre-imported GeoSPARQL schema that we mentioned earlier. For the object, we will need a bridging IRI that will be the GeoSPARQL point. We will take it from the Trcid column: Type ‣ IRI, Source ‣ Column with value Trcid. In the Prefix field, enter amsterdam:restaurant/ which will extend the predefined amsterdam prefix.

When something is an IRI, we can continue with the mapping by attaching more predicates and subjects to it, so that it becomes the subject of the next triples. We call these nested triples. You can add one by clicking the Add nested triple (->) icon outside the right border of the object cell. This will open new predicate and object cells below, which are in a green frame together with the ones on top, thus indicating the nesting.

For this IRI that is now the subject of the nested triple, add in the next row a as predicate type, and sf:Point as object (Source ‣ Constant, and Point as its value). Then click -> again to add another nested triple, which will be the point that we will construct. As its predicate, enter geo:asWKT (Source ‣ Constant, then prefix geo and constant asWKT. For the object, we will use the row_index:

Hint

Besides using the values from the columns, we can also use the row index, for example: if we did not have a column for the ID in our data (Trcid), we could use the row index to construct an IRI.

Let’s edit this point. In the object field, set the RDF type to Literal and chose GREL as a source. Type the following GREL expression: "http://www.opengis.net/def/crs/OGC/1.3/CRS84 POINT (" + cells["Longitude"].value.replace(',', '.') + " " + cells["Latitude"].value.replace(',', '.')  + ")". It specifies how the value for Point will be constructed. Note that we combine the values of two columns from our table data here – Latitude and Longitude, to construct one single GeoSPARQL point from them. While typing a GREL expression, a preview of the results is shown to guide you. Click the info icon to reach the GREL documentation.

Further, for Literal attributes we will set Datatype to configure the Datatype for this Literal. In the Constant field, add prefix geo and then constant wktLiteral to add the geo:wktLiteral Datatype to our Point.

We can also use blank nodes in the mapping.

Unique Bnode example:

Add the following predicate: Source ‣ Constant with prefix amsterdam and value uniquelocation. For the object, set RDF type ‣ Unique Bnode, and the value of the Trcid column as Source. In the preview, we see that a unique blank node is created.

Now, let’s add a nested triple to it. Clicking the right-hand arrow, add a predicate as follows: Source ‣ Constant with prefix amsterdam and value address. For the object, set -, and the value of the Adres column as Source.

Value Bnode example:

Let’s add another predicate: Source ‣ Constant with prefix amsterdam and value valuelocation. For the object, set RDF type ‣ Value Bnode, and the value of the Trcid column as Source. In the preview, we see that a blank node is created with the source value of the Trcid column.

We will also add a nested triple to it the way we did above: Source ‣ Constant with prefix amsterdam and value city. For the object, set RDF type ‣ Literal, and the value of the City column as Source.

Hint

The source for Bnode is needed as it allows for the distinguishing between these two types of nodes. Also, if a source is missing (i.e., that column does not have a value for that identifier), a Bnode will not be created.

This concludes our example of several basic parameters that are commonly used when RDFizing structured data.

* The last two examples with Bnodes are not included in the image for better visibility.

@base <http://example/base/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix amsterdam: <https://data/amsterdam/nl/resource/> .
@prefix sf: <http://www.opengis.net/ont/sf#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<https://data/amsterdam/nl/resource/restaurant/669d7d82-8962-4e88-b2e1-7b8706633aa0>
a schema:Restaurant;
schema:title "Smits Noord-Zuid Hollandsch Koffiehuis", "Smits Noord-Zuid Hollandsch Koffiehuis"@en;
schema:description "Het Smits Koffiehuis ontleent haar ontstaan aan de stoomtram die de verbinding onderhield met Amsterdam naar het noorden van de provincie en is in 1919 gebouwd. Nu is er een restaurant en een koffiebar. Ook is hier een informatiekantoor van Amsterdam Marketing gehuisvest.";
schema:latitude "52.3775440"^^xsd:float;
amsterdam:zipcode "1012 AB";
schema:image <https%3A//media.iamsterdam.com/ndtrc/Images/20101122/ec8faec5-5cd5-43d6-b0fa-eb0dab65e278.jpg>;
geo:hasGeometry <https://data/amsterdam/nl/resource/geometry/669d7d82-8962-4e88-b2e1-7b8706633aa0>;
amsterdam:uniquelocation _:node1f5diav34x133457;
amsterdam:valuelocation _:669d7d82-8962-4e88-b2e1-7b8706633aa0 .

<https://data/amsterdam/nl/resource/geometry/669d7d82-8962-4e88-b2e1-7b8706633aa0>
a sf:Point;
geo:asWKT "<http://www.opengis.net/def/crs/OGC/1.3/CRS84> POINT (4.9003230 52.3775440)"^^geo:wktLiteral .

_:669d7d82-8962-4e88-b2e1-7b8706633aa0 amsterdam:city "AMSTERDAM" .


You can download the JSON file of the mapping we just created, and import it into the RDF mapper for a closer look.

### RDFize data using SPARQL¶

If you are a more proficient SPARQL user and want to configure your data in a different way, OntoRefine provides that option as well. The SPARQL button will open a new SPARQL Query & Update endpoint where a CONSTRUCT query based on the newly configured RDF model is generated.

You can also download the mapping SPARQL query results in various formats, including JSON, Turtle, and TriG.

## OntoRefine CLI¶

The main OntoRefine functionalities can also be used via the OntoRefine command line tool (CLI), which serves as a REST API for executing operations without having to interact with the OntoRefine UI. This is quite useful for automating certain steps and processes when cleaning up, transforming, enriching, storing and exposing search operations over various datasets.

The tool is located under bin/ontorefine-cli in the GraphDB distribution.

To see details about each of the available commands described below, execute:

ontorefine-cli help [COMMAND]


For the following examples, we will be using the same tabular dataset with Netherlands restaurants.

### Create a project¶

To create an OntoRefine project with the Netherlands_restaurants.csv dataset, execute the following command:

bin/ontorefine-cli create "Netherlands_restaurants.csv" --name nl_restaurants -u http://localhost:7200


If you open the GraphDB Workbench and go to Import ‣ Tabular (OntoRefine) ‣ Open Project, you will see the newly created project.

You can create a project with datasets in other formats as well, e.g., TSV, XLS, etc.

#### Create a project with default name¶

If you want to create a project with the same name as that of the dataset file, execute:

bin/ontorefine-cli create "Netherlands_restaurants.csv" -u http://localhost:7200


### Extract transformations¶

Let’s see how some changes that we have made in the OntoRefine project can be extracted via the CLI.

For example, since all columns in OntoRefine are imported as text, we will transform the cell content of the Latitude and Longitude columns from text to number:

1. Go to Import ‣ Tabular (OntoRefine) ‣ Open Project and open the nl_restaurants project that we just created.

2. Since the delimiter of the data is ,, we need to change it to .:

• In the Latitude column, go to Edit cells ‣ Transform.

• In the Expression field in the dialog that opens, write the GREL expression value.replace(",","."). Click OK.

3. In the same column, select Edit cells ‣ Common transforms ‣ To number. Observe how the cell content has changed.

4. Repeat the same steps for the Longitude column.

5. In the terminal, execute:

bin/ontorefine-cli extract <project-id> -u http://localhost:7200


The tool will print out a JSON format of the transformation in the console:

[
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "Latitude",
"expression": "grel:value.replace(\",\",\".\")",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Latitude using expression grel:value.replace(\",\",\".\")"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "Latitude",
"expression": "value.toNumber()",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Latitude using expression value.toNumber()"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "Longitude",
"expression": "grel:value.replace(\",\",\".\")",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Longitude using expression grel:value.replace(\",\",\".\")"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "row-based"
},
"columnName": "Longitude",
"expression": "value.toNumber()",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Longitude using expression value.toNumber()"
}
]


### Apply transformations¶

We can also apply a transformation like the one above through the CLI:

1. Save the transformation of the Latitude and Longitude columns that we just made as a JSON file - e.g., transformation.json.

2. Like above, open the nl_restaurants project.

3. To revert it to its initial state (i.e., without any transformations applied yet), in the panel on the left, go to the Undo/Redo tab and click on the first action that we performed - 0. Create project.

4. In the terminal, execute:

bin/ontorefine-cli apply "transformation.json" <project-id> -u http://localhost:7200

5. Observe how the transformation has been applied to the content of both columns.

### Export RDF mapping¶

Through the CLI, we can also export an RDF mapping made in the Workbench mapping UI.

1. Open the nl_restaurants project and go to RDF Mapping ‣ Edit RDF Mapping.

2. Create a simple example mapping:

• Subject: Trcid

• Predicate: a

• Object: Restaurant

3. Save it.

4. In the terminal, execute:

bin/ontorefine-cli rdf <project-id> -u http://localhost:7200


The tool will print out the RDF mapping in the console.

### Data reconciliation¶

The CLI can also be used to reconcile project data against other databases.

#### Reconcile against Wikidata¶

The OntoRefine UI has a built-in capability for reconciling data against the free Wikidata knowledge graph. Let’s see how we can use it with the CLI.

In our project, we will reconcile the Title column. To do so, execute:

bin/ontorefine-cli reconcile <project-id> Title https://wikidata.reconci.link/en/api -u http://localhost:7200


Note

The operation may take a while.

Once the process is complete, if you go to the project in the OntoRefine UI and refresh the page, you will see that the Title column has been reconciled against Wikidata.

The command can also be run asynchronously:

bin/ontorefine-cli reconcile -a <project-id> Title https://wikidata.reconci.link/en/api -u http://localhost:7200


#### Reconcile against other registered services¶

We can also register a different reconciliation service and reconcile the data against it. Here, we will use the Getty Vocabularies Reconciliation Service.

bin/ontorefine-cli register-service -u http://localhost:7200 https://services.getty.edu/vocab/reconcile/


Note

Since this is a global command that will affect all OntoRefine projects in your database, it does not contain a <project-id>.

Refresh the OntoRefine project in the UI. From the name of any column, go to Reconcile ‣ Start reconciling. You will see that the newly added service is now in the list.

To reconcile the Title column of nl_restaurants against it, execute:

bin/ontorefine-cli reconcile <project-id> Title https://services.getty.edu/vocab/reconcile/ -u http://localhost:7200


Once the process is complete, if you go to the project in the OntoRefine UI and refresh the page, you will see that the Title column has been reconciled against the Getty service.

### Export a project¶

Via the CLI, you can export projects in the following formats:

• JSON

bin/ontorefine-cli export <project-id> -u http://localhost:7200 json

• CSV

bin/ontorefine-cli export <project-id> -u http://localhost:7200 csv


Where <project-id> is the ID of the project we created as returned by the CLI.

### Delete a project¶

To delete a project via the CLI, execute:

bin/ontorefine-cli delete <project-id> -u http://localhost:7200


If you go to the OntoRefine UI and refresh it, you will see that the project has been deleted.

## Benchmarks¶

The below table illustrates the expected scalability performance of OntoRefine during data load, operations, and export. The tests have been conducted on a 6-core/12-thread system with NVMe drive, and 2, 4, 8, and 16 gigabytes of RAM, respectively.

For the purpose of the test, we have used a tool that:

1. Generates a CSV file with 4 columns.

2. Creates a project and uploads the file.

3. Performs some operations, such as creating and removing a column (see them in detail here).

4. Exports the project in CSV format.

Number of lines

2GB RAM

4GB RAM

8GB RAM

16GB RAM

operations (s)

export (s)

operations (s)

export (s)

operations (s)

export (s)

operations (s)

export (s)

500,000

0.80

3.98

0.38

0.81

4.31

0.41

0.76

4.37

0.41

0.70

4.00

0.38

1,000,000

1.72

8.27

0.82

1.72

8.48

0.80

1.51

8.48

0.79

1.38

8.20

0.75

1,500,000

2.39

21.97

1.34

2.2

24.10

1.24

2.15

12.53

1.18

2.14

11.86

1.16

2,000,000

3.15

18.13

1.74

3.08

33.72

1.63

2.93

16.17

1.66

2.79

15.59

1.55

The next table shows the expected scalability performance of OntoRefine mapping editor during the generation and downloading of an RDF file with the mapping results, as well as during the execution of the generated SPARQL CONSTRUCT query.

• Adding type to value from a column;

• Creating a new predicate with values from columns as objects;

• The same as the above two, but using GREL and prefixes;

• Reusing IRI cell to add triples (children);

• Using literal transformations on many columns;

• Using prefixes from the selected repository, adding new ones, and using them in OntoRefine.

The tests have also been conducted on a 6-core/12-thread system with NVMe drive, and 20 gigabytes of RAM (Xmx20g). The dataset contains 56 columns.

RDF

SPARQL CONSTRUCT

Number of rows

Execution time (ms)

Execution time (min)

Execution time (ms)

Execution time (min)

100,000

11,26

0,19

16,33

0,27

200,000

21,58

0,36

28,49

0,47

500,000

53,70

0,89

70,15

1,17

1,000,000

110,33

1,84

168,70

2,81

2,000,000

223,13

3,72

331,70

5,53