Loading data using OntoRefine

OntoRefine – overview and features

GraphDB OntoRefine is an upgraded version of the open source OpenRefine data transformation tool. It allows the quick mapping of any structured data to a locally stored RDF schema in GraphDB. The visual interface is optimized to guide you in choosing the right predicates and types, defining the datatype to RDF mappings, and implementing complex transformation using OpenRefine’s GREL language. GREL is the Google Refine Expression Language that helps you define complex transformation.

OntoRefine is integrated in the GraphDB Workbench, and supports the formats TSV, CSV, *SV, XLS, XLSX, JSON, XML, RDF as XML, and Google sheet. It enables you to:

  • Upload your data file(s) and create a project.

  • Create an RDF model of the cleaned data.

  • Transform your data using SPIN functions.

  • Further modify your RDFized data in the GraphDB SPARQL endpoint.

Example data

For the examples in this guide, we will be working with the following dataset:

Upload data in OntoRefine

Open OntoRefine in the Workbench

To transform your data into RDF, you need a working GraphDB database.

  1. Start GraphDB in Workbench mode.

  2. Open http://localhost:7200/ in a browser.

  3. Create a repository.

  4. Go to Import -> Tabular (OntoRefine).

All data files in OntoRefine are organized as projects. One project can have more than one data file.

The Create Project action area consists of three tabs corresponding to the source of data. You can upload a file from your computer, specify the URL of a publicly accessible data, or paste data from the clipboard.

Create a project

  1. Click Create Project -> Get data from.

  2. Select one or more files to upload:

    • from your computer

      _images/ontoRefine-upload-file.png
    • from web addresses (URLs)

      _images/ontoRefine-upload-file-url.png
    • from clipboard

      _images/ontoRefine-upload-file-clipboard.png
  3. Click Next.

  4. (Optional) Change the table configurations and update the preview.

    With the first opening of the file, OntoRefine tries to recognize the encoding of the text file and all delimiters.

    _images/ontoRefine-table-configurations.png
  5. Click Create Project.

Import a project

To import an already existing OntoRefine project:

  1. Go to Import Project.

  2. Select a file (.tar or .tar.gz)

  3. Import it.

    _images/ontoRefine-import-project.png

Open a project

Once the project is created:

  1. Go to Open Project.

  2. Click the one you want to work on.

  3. (Optional) You can also delete your project if you want to.

_images/ontoRefine-rename-delete.png

The result of each of these actions is a table similar to that of an Excel or a Google sheet:

_images/ontoRefine-view-project.png

RDFize tabular data

Mapping interface

This walk-through will show you how to map your tabular data against an existing ontology, in this case the schema-org.rdf ontology. Upload it into the repository as shown here. After that, go to Setup -> Autocomplete and enable the autocomplete index. Return to the OntoRefine project.

The RDF Mapping button will take you to the mapping editor, which is an extension of the OntoRefine functionality. Here, you can:

  • configure and preview the mapping model of your data

  • save your mapping

  • download a JSON model of your mapping

  • upload a JSON model of a mapping

  • convert the result of the mapping to RDF data that is downloaded as a .ttl file

  • generate a SPARQL query of your mapping and open it in a GraphDB SPARQL endpoint

  • create a new mapping

You can close the mapping editor with the X button on the top right.

The headers of all columns in the tabular data that we imported are displayed as boxes that you can drag and drop into a mapping cell to configure them.

Each row in the table represents an RDF triple constructed from the tabular data. If two (or more) triples have the same subject, they will be displayed as a triple with one subject and two (or more) predicates. Analogically, if two (or more) triples have the same subject + predicate, but different objects, they will be displayed as a triple with one subject + predicate and multiple different objects.

_images/ontoRefine-mapping-home-screen.png

Important

To save your mapping model in Git or to automate the import of structured data into GraphDB, you can download the JSON model and generate the SPARQL query of your mapping, which can be downloaded in several formats including JSON, Turtle, and TriG. The mapping API will then reference these files.

Value mapping

The value mapping describes how a single tabular cell is converted to an RDF value. Each such mapping requires a value source, a value type, and may have an optional value transformation.

Value type

The value type defines the type of RDF value that will be produced by the value mapping. The possible types are:

Resource (abstract)

An RDF resource. This is an abstract type that specifies the common features of IRIs and blank nodes. A resource value type may have type mappings and property mappings on its own.

IRI

An RDF IRI. the transformed value is the IRI with illegal characters escaped automatically. This is a subtype of the abstract Resource value type.

Blank node based on value (value Bnode)

An RDF blank node. The transformed value is used to calculate a reasonable blank node identifier such that identical transformed values produce the same blank node. This is a subtype of the abstract Resource value type.

Unique blank node (unique Bnode)

An RDF blank node. The transformed value is ignored and a unique blank node is created every time. This is a subtype of the abstract Resource value type.

Any literal (abstract)

Any kind of RDF literal (plain, language, or datatype). This is an abstract type that unifies all literal value types.

Literal

An RDF plain literal. The transformed value is the literal’s label.

Literal with a language

An RDF literal with a language. The transformed value is the literal’s label. The language is a simple literal value mapping, i.e., identical to a value mapping with type literal.

Literal with a datatype

An RDF literal with a datatype. The transformed value is the literal’s label. The datatype is a simple IRI value mapping.

Value source

Each value mapping has an associated value source: the row or record index, a column identified by name, or a constant value.

Value transformation

Each value mapping may have an optional transformation applied to the data received from the value source before the value is created. Each transformation has an associated language and expression. The expression is evaluated according to the rules of the language in the context of the value source. The languages are:

Language “prefix”

The expression is a namespace prefix applied to the value received from the source, if that value is not an absolute IRI already.

Language “grel”

The expression is a GREL expression.

Prefixes

The mapping tree contains a set of prefixes that are used in the cell configuration. They are defined in the prefix area and can be of three types:

  • default prefixes from commonly used RDF schemas, such as foaf, geo, rdf, rdfs, skos, xsd. You can select entities from these schemas without importing them in your repository.

  • prefixes that you select from the imported ontology,

  • and such that you create ourself.

Add the following prefixes in the Prefix field:

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX amsterdam: <https://data/amsterdam/nl/resource/>

Triple configuration

Tip

You may find it convenient to be able to preview the end RDF results while still being able to configure them. To do so, choose Both from the options on the top left.

Restaurant ID and type

For the RDF subject value, we will take the value of Trcid column. You do not need to know all the headers of your data, as typing @ in the field will display a drop-down list of all available column headers.

Set the rdf:type for the predicate - a.

As object value, enter schema:Restaurant as present in the schema-org ontology.

Now let’s edit the subject value. The edit icon will open the Subject mapping configuration window. We can see that the Source for this value is the value of the Trcid column. In the Prefix field, type amsterdam:restaurant/, which will extend the predefined amsterdam prefix.

_images/ontoRefine-triple1.png

Hint

The preview of an IRI is a clickable link that points to the corresponding resource in the repository, e.g., the below restaurant ID.

_images/ontoRefine-clickable-iri-link.png

To view the RDF results at any time, click the RDF button. This will download a result-triples.ttl file where we can see the @prefix amsterdam: <http://data.amsterdam.nl/resource/> namespace that we defined extended with restaurant, as well as the IDs of the restaurants that come from the values in the Trcid column.

Note

If you try to navigate away from the mapping screen, to close it, or to open a new mapping, a warning message will inform you that if you proceed with that action, all mappings will be lost.

As our goal here is to map the tabular data against the schema-org ontology, we will continue adding predicates and objects. Since the mapping table is modeled after the Turtle syntax, we can attach more than one predicate to one subject and more than one object to one predicate without having to repeat the same triples in every row.

Title

In the next row, let’s add the schema:title predicate by autocompleting it from the imported ontology: as Source, select Constant, and then enter for Prefix - schema and for Constant - title. The object’s RDF Type will be Literal, and its value will come from the Title column from the tabular data. As explained above, type @ and select Title from the invoked list.

We will also add another object that will be the title in English, i.e., TitleEN. When a type is a Literal, we can further specify its Literal attributes to be Language Literal or Datatype Literal. Here, we choose the secondary type to be Language Literal, which will configure a Language tag for that Literal.

_images/ontoRefine-predicate-title.png

Hint

Note that the names of the columns that are already being used in the mapping now appear in grey.

Description

In the next row, let’s add another predicate from the ontology – schema:description. As its object, we will use the value from the Shortdescription column.

If you download the .ttl file with the RDF results and open it, you will see that the data is now mapped the way we intended, e.g.:

@base <http://example/base/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix amsterdam: <https://data.amsterdam.nl/resource/> .
@prefix sf: <http://www.opengis.net/ont/sf#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dbo: <http://dbpedia.org/ontology/> .

<https://data.amsterdam.nl/resource/restaurant/669d7d82-8962-4e88-b2e1-7b8706633aa0>
  a schema:Restaurant;
  schema:title "Smits Noord-Zuid Hollandsch Koffiehuis", "Smits Noord-Zuid Hollandsch Koffiehuis"@en;
  schema:description "Het Smits Koffiehuis ontleent haar ontstaan aan de stoomtram die de verbinding onderhield met Amsterdam naar het noorden van de provincie en is in 1919 gebouwd. Nu is er een restaurant en een koffiebar. Ook is hier een informatiekantoor van Amsterdam Marketing gehuisvest.";

One block corresponds to one row in the OntoRefine mapping table.

Latitude

Another predicate that we can use from the ontology is schema:latitude. Its object will be the value of the Latitude column, but we want to edit it the following way: RDF Type -> Literal, Source - GREL with the following expression that will replace the comma in the latitude value with a decimal point: value.replace(',','.'). Then Literal attributes -> Datatype, and Datatype source -> Constant with prefix xsd and constant float.

The same can be done for Longitude.

Note

When writing GREL expressions, keep in mind that column names are case-sensitive.

Zip code

Next, we will map the zip code of the data. Since there is no prefix for it in the schema-org ontology, we will create one – amsterdam:zipcode. The object value will come from the Zipcode column.

Image

We can also map the images in the data. As predicate, use the ontology prefix schema:image, for which the corresponding column from the tabular data is Media. We will edit it to be an IRI as it is an image location - Type -> IRI.

Geographical point (as a nested triple)

Another thing we can map is a geographical point for the restaurant. To do this, we will define a GeoSPARQL point constructed by the latitude and longitude. For predicate, select geo:hasGeometry by autocompleting it from the pre-imported GeoSPARQL schema that we mentioned earlier. For the object, we will need a bridging IRI that will be the GeoSPARQL point. We will take it from the Trcid column: Type -> IRI, Source -> Column with value Trcid. In the Prefix field, enter amsterdam:restaurant/ which will extend the predefined amsterdam prefix.

When something is an IRI, we can continue with the mapping by attaching more predicates and subjects to it, so that it becomes the subject of the next triples. We call these nested triples. You can add one by clicking the Add nested triple (->) icon outside the right border of the object cell. This will open new predicate and object cells below, which are in a green frame together with the ones on top, thus indicating the nesting.

For this IRI that is now the subject of the nested triple, add in the next row a as predicate type, and sf:Point as object (Source -> Constant, and Point as its value). Then click -> again to add another nested triple, which will be the point that we will construct. As its predicate, enter geo:asWKT (Source -> Constant, then prefix geo and constant asWKT. For the object, we will use the row_index:

Hint

Besides using the values from the columns, we can also use the row index, for example: if we did not have a column for the ID in our data (Trcid), we could use the row index to construct an IRI.

Let’s edit this point. In the object field, set the RDF type to Literal and chose GREL as a source. Type the following GREL expression: "http://www.opengis.net/def/crs/OGC/1.3/CRS84 POINT (" + cells["Longitude"].value.replace(',', '.') + " " + cells["Latitude"].value.replace(',', '.')  + ")". It specifies how the value for Point will be constructed. Note that we combine the values of two columns from our table data here – Latitude and Longitude, to construct one single GeoSPARQL point from them. While typing a GREL expression, a preview of the results is shown to guide you. Click the info icon to reach the GREL documentation.

Further, for Literal attributes we will set Datatype to configure the Datatype for this Literal. In the Constant field, add prefix geo and then constant wktLiteral to add the geo:wktLiteral Datatype to our Point.

_images/ontoRefine-geosparql-point.png

We can also use blank nodes in the mapping.

Unique Bnode example:

Add the following predicate: Source -> Constant with prefix amsterdam and value uniquelocation. For the object, set RDF type -> Unique Bnode, and the value of the Trcid column as Source. In the preview, we see that a unique blank node is created.

Now, let’s add a nested triple to it. Clicking the right-hand arrow, add a predicate as follows: Source -> Constant with prefix amsterdam and value address. For the object, set RDF type -> Literal, and the value of the Adres column as Source.

_images/ontoRefine-unique-bnode.png

Value Bnode example:

Let’s add another predicate: Source -> Constant with prefix amsterdam and value valuelocation. For the object, set RDF type -> Value Bnode, and the value of the Trcid column as Source. In the preview, we see that a blank node is created with the source value of the Trcid column.

We will also add a nested triple to it the way we did above: Source -> Constant with prefix amsterdam and value city. For the object, set RDF type -> Literal, and the value of the City column as Source.

_images/ontoRefine-value-bnode.png

Hint

The source for Bnode is needed as it allows for the distinguishing between these two types of nodes. Also, if a source is missing (i.e., that column does not have a value for that identifier), a Bnode will not be created.

This concludes our example of several basic parameters that are commonly used when RDFizing structured data.

_images/ontoRefine-final-mapping.png

* The last two examples with Bnodes are not included in the image for better visibility.

The RDF end result should look like this:

@base <http://example/base/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix amsterdam: <https://data/amsterdam/nl/resource/> .
@prefix sf: <http://www.opengis.net/ont/sf#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<https://data/amsterdam/nl/resource/restaurant/669d7d82-8962-4e88-b2e1-7b8706633aa0>
  a schema:Restaurant;
  schema:title "Smits Noord-Zuid Hollandsch Koffiehuis", "Smits Noord-Zuid Hollandsch Koffiehuis"@en;
  schema:description "Het Smits Koffiehuis ontleent haar ontstaan aan de stoomtram die de verbinding onderhield met Amsterdam naar het noorden van de provincie en is in 1919 gebouwd. Nu is er een restaurant en een koffiebar. Ook is hier een informatiekantoor van Amsterdam Marketing gehuisvest.";
  schema:latitude "0"^^xsd:float;
  amsterdam:zipcode "1012 AB";
  schema:image <https%3A//media.iamsterdam.com/ndtrc/Images/20101122/ec8faec5-5cd5-43d6-b0fa-eb0dab65e278.jpg>;
  geo:hasGeometry <https://data/amsterdam/nl/resource/geometry/669d7d82-8962-4e88-b2e1-7b8706633aa0>;
  amsterdam:uniquelocation _:node1em9j7qmhx179149;
  amsterdam:valuelocation _:669d7d82-8962-4e88-b2e1-7b8706633aa0 .

<https://data/amsterdam/nl/resource/geometry/669d7d82-8962-4e88-b2e1-7b8706633aa0>
  a sf:Point;
  geo:asWKT "<http://www.opengis.net/def/crs/OGC/1.3/CRS84> POINT (4.9003230 52.3775440)"^^geo:wktLiteral .

_:node1em9j7qmhx179149 amsterdam:address "Stationsplein 10" .

_:669d7d82-8962-4e88-b2e1-7b8706633aa0 amsterdam:city "AMSTERDAM" .

You can download the JSON file of the mapping we just created, and import it into the RDF mapper for a closer look.

RDFize data using SPARQL

If you are a more proficient SPARQL user and want to configure your data in a different way, OntoRefine provides that option as well. The SPARQL button will open a new SPARQL Query & Update endpoint where a CONSTRUCT query based on the newly configured RDF model is generated.

You can also download the mapping SPARQL query results in various formats, including JSON, Turtle, and TriG.

_images/ontoRefine-sparql-view.png

Benchmarks

The below table illustrates the expected scalability performance of OntoRefine during data load, operations, and export. The tests have been conducted on a 6-core/12-thread system with NVMe drive, and 2, 4, 8, and 16 gigabytes of RAM, respectively.

For the purpose of the test, we have used a tool that:

  1. Generates a CSV file with 4 columns.

  2. Creates a project and uploads the file.

  3. Performs some operations, such as creating and removing a column (see them in detail here).

  4. Exports the project in CSV format.

Number of lines

2GB RAM

4GB RAM

8GB RAM

16GB RAM

load (s)

operations (s)

export (s)

load (s)

operations (s)

export (s)

load (s)

operations (s)

export (s)

load (s)

operations (s)

export (s)

500,000

0.80

3.98

0.38

0.81

4.31

0.41

0.76

4.37

0.41

0.70

4.00

0.38

1,000,000

1.72

8.27

0.82

1.72

8.48

0.80

1.51

8.48

0.79

1.38

8.20

0.75

1,500,000

2.39

21.97

1.34

2.2

24.10

1.24

2.15

12.53

1.18

2.14

11.86

1.16

2,000,000

3.15

18.13

1.74

3.08

33.72

1.63

2.93

16.17

1.66

2.79

15.59

1.55

The next table shows the expected scalability performance of OntoRefine mapping editor during the generation and downloading of an RDF file with the mapping results, as well as during the execution of the generated SPARQL CONSTRUCT query.

About 20 transformations have been made, including:

  • Adding type to value from a column;

  • Creating a new predicate with values from columns as objects;

  • The same as the above two, but using GREL and prefixes;

  • Reusing IRI cell to add triples (children);

  • Using literal transformations on many columns;

  • Using prefixes from the selected repository, adding new ones, and using them in OntoRefine.

The tests have also been conducted on a 6-core/12-thread system with NVMe drive, and 20 gigabytes of RAM (Xmx20g). The dataset contains 56 columns.

RDF

SPARQL CONSTRUCT

Number of rows

Execution time (ms)

Execution time (min)

Execution time (ms)

Execution time (min)

100,000

11,26

0,19

16,33

0,27

200,000

21,58

0,36

28,49

0,47

500,000

53,70

0,89

70,15

1,17

1,000,000

110,33

1,84

168,70

2,81

2,000,000

223,13

3,72

331,70

5,53