Loading data using OntoRefine

OntoRefine – overview and features

GraphDB OntoRefine is an upgraded version of the open source OpenRefine data transformation tool. It allows the quick mapping of any structured data to a locally stored RDF schema in GraphDB. The visual interface is optimized to guide you in choosing the right predicates and types, defining the datatype to RDF mappings, and implementing complex transformation using OpenRefine’s GREL language. GREL is the Google Refine Expression Language that helps you define complex transformation.

OntoRefine is integrated in the GraphDB Workbench, and supports the formats TSV, CSV, *SV, XLS, XLSX, JSON, XML, RDF as XML, and Google sheet. It enables you to:

  • Upload your data file(s) and create a project.

  • Create an RDF model of the cleaned data.

  • Transform your data using SPIN functions.

  • Further modify your RDFized data in the GraphDB SPARQL endpoint.

Example data

For the examples in this guide, we will be working with the following dataset:

Upload data in OntoRefine

Open OntoRefine in the Workbench

To transform your data into RDF, you need a working GraphDB database.

  1. Start GraphDB in Workbench mode.

  2. Open http://localhost:7200/ in a browser.

  3. Create a repository.

  4. Go to Import -> Tabular (OntoRefine).

All data files in OntoRefine are organized as projects. One project can have more than one data file.

The Create Project action area consists of three tabs corresponding to the source of data. You can upload a file from your computer, specify the URL of a publicly accessible data, or paste data from the clipboard.

Create a project

  1. Click Create Project -> Get data from.

  2. Select one or more files to upload:

    • from your computer

      _images/ontoRefine-upload-file.png
    • from web addresses (URLs)

      _images/ontoRefine-upload-file-url.png
    • from clipboard

      _images/ontoRefine-upload-file-clipboard.png
  3. Click Next.

  4. (Optional) Change the table configurations and update the preview.

    With the first opening of the file, OntoRefine tries to recognize the encoding of the text file and all delimiters.

    _images/ontoRefine-table-configurations.png
  5. Click Create Project.

Import a project

To import an already existing OntoRefine project:

  1. Go to Import Project.

  2. Select a file (.tar or .tar.gz)

  3. Import it.

    _images/ontoRefine-import-project.png

Open a project

Once the project is created:

  1. Go to Open Project.

  2. Click the one you want to work on.

  3. (Optional) You can also delete your project if you want to.

_images/ontoRefine-rename-delete.png

The result of each of these actions is a table similar to that of an Excel or a Google sheet:

_images/ontoRefine-view-project.png

RDFize tabular data

Mapping interface

This walk-through will show you how to map your tabular data against an existing ontology, in this case the schema-org.rdf ontology. Upload it into the repository as shown here. After that, go to Setup -> Autocomplete and enable the autocomplete index. Return to the OntoRefine project.

The RDF Mapping button will take you to the mapping editor, which is an extension of the OntoRefine functionality. Here, you can:

  • configure and preview the mapping model of your data

  • save your mapping

  • download a JSON model of your mapping

  • upload a JSON model of a mapping

  • convert the result of the mapping to RDF data that is downloaded as a .ttl file

  • generate a SPARQL query of your mapping and open it in a GraphDB SPARQL endpoint

  • create a new mapping

You can close the mapping editor with the X button on the top right.

The headers of all columns in the tabular data that we imported are displayed as boxes that you can drag and drop into a mapping cell to configure them.

Each row in the table represents an RDF triple constructed from the tabular data. It can have one or more subjects as roots, with each subject having one or more predicates as child elements, and each predicate having one or more objects as child elements.

_images/ontoRefine-mapping-home-screen.png

Important

To save your mapping model in Git or to automate the import of structured data into GraphDB, you can download the JSON model and generate the SPARQL query of your mapping, which can be downloaded in several formats including JSON, Turtle, and TriG. The mapping API will then reference these files.

Prefixes

The mapping tree contains a set of prefixes that are used in the cell configuration. They are defined in the prefix area and can be of three types:

  • default prefixes from commonly used RDF schemas, such as foaf, geo, rdf, rdfs, skos, xsd. You can select entities from these schemas without importing them in your repository.

  • prefixes that you select from the imported ontology,

  • and such that you create ourself.

Add the following prefix in the Prefix field:

PREFIX dbo: <http://dbpedia.org/ontology/>

Triple configuration

Hint

You may find it convenient to be able to preview the end RDF results while still being able to configure them. To do so, choose Both from the options on the top left.

Restaurant ID and type

For the RDF subject value, we will take the value of Trcid column. You do not need to know all the headers of your data, as typing @ in the field will display a drop-down list of all available column headers.

Set the rdf:type for the predicate - a.

As object value, enter schema:Restaurant as present in the schema-org ontology.

Now let’s edit the subject value. The edit icon will open the Subject mapping configuration window. We can see that the Source for this value is the value of the Trcid column. In the Prefix field, type amsterdam:restaurant/, which will extend the predefined amsterdam prefix.

_images/ontoRefine-triple1.png

Hint

Note that the names of the columns that are already being used in the mapping now appear in grey.

To view the RDF results at any time, click the RDF button. This will download a result-triples.ttl file where we can see the @prefix amsterdam: <http://data.amsterdam.nl/resource/> namespace that we defined extended with restaurant, as well as the IDs of the restaurants that come from the values in the Trcid column.

Note

If you try to navigate away from the mapping screen, to close it, or to open a new mapping, a warning message will inform you that if you proceed with that action, all mappings will be lost.

As our goal here is to map the tabular data against the schema-org ontology, we will continue adding predicates and objects. Since the mapping table is modeled after the Turtle syntax, we can attach more than one predicate to one subject and more than one object to one predicate without having to repeat the same triples in every row.

Title

In the next row, let’s add the schema:title predicate by autocompleting it from the imported ontology: as Source, select Constant, and then enter for Prefix - schema and for Constant - title. The object’s RDF Type will be Literal, and its value will come from the Title column from the tabular data. As explained above, type @ and select Title from the invoked list.

We will also add another object that will be the title in English, i.e., TitleEN. When a type is a Literal, we can further specify its Literal attributes to be Language Literal or Datatype Literal. Here, we choose the secondary type to be Language Literal, which will configure a Language tag for that Literal.

_images/ontoRefine-predicate-title.png

Description

In the next row, let’s add another predicate from the ontology – schema:description. As its object, we will use the value from the Shortdescription column.

If you download the .ttl file with the RDF results and open it, you will see that the data is now mapped the way we intended, e.g.:

@base <http://example/base/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix amsterdam: <https://data.amsterdam.nl/resource/> .
@prefix sf: <http://www.opengis.net/ont/sf#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dbo: <http://dbpedia.org/ontology/> .

<https://data.amsterdam.nl/resource/restaurant/669d7d82-8962-4e88-b2e1-7b8706633aa0>
  a schema:Restaurant;
  schema:title "Smits Noord-Zuid Hollandsch Koffiehuis", "Smits Noord-Zuid Hollandsch Koffiehuis"@en;
  schema:description "Het Smits Koffiehuis ontleent haar ontstaan aan de stoomtram die de verbinding onderhield met Amsterdam naar het noorden van de provincie en is in 1919 gebouwd. Nu is er een restaurant en een koffiebar. Ook is hier een informatiekantoor van Amsterdam Marketing gehuisvest.";

One block corresponds to one row in the OntoRefine mapping table.

Latitude

Another predicate that we can use from the ontology is schema:latitude. Its object will be the value of the Latitude column, but we want to edit it the following way: Type is Datatype Literal, Source - Constant with value float,and Type transformation - Prefix with value xsd:. We also want to replace the comma in the latitude value with a decimal point, which is done with the Transformation GREL expression cells["Latitude"].value.replace(',','.').

The same can be done for Longitude.

Zip code

Next, we will map the zip code of the data. Since there is no prefix for it in the schema-org ontology, we will create one – amsterdam:zipcode. The object value will come from the Zipcode column.

Image

We can also map the images in the data. As predicate, use the ontology prefix schema:image, for which the corresponding column from the tabular data is Media. We will edit it to be an IRI as it is an image location - Type -> IRI.

Geographical point (as a nested triple)

Another thing we can map is a geographical point for the restaurant. To do this, we will define a GeoSPARQL point constructed by the latitude and longitude. For predicate, select geo:hasGeometry by autocompleting it from the pre-imported GeoSPARQL schema that we mentioned earlier. For the object, we will need a bridging IRI that will be the GeoSPARQL point. We will take it from the Trcid column: Type -> IRI, Source -> Column with value Trcid. In the Prefix field, enter amsterdam:restaurant/ which will extend the predefined amsterdam prefix.

When something is an IRI, we can continue with the mapping by attaching more predicates and subjects to it, so that it becomes the subject of the next triples. We call these nested triples. You can add one by clicking the Add nested triple (->) icon outside the right border of the object cell. This will open new predicate and object cells below, which are in a green frame together with the ones on top, thus indicating the nesting.

For this IRI that is now the subject of the nested triple, add in the next row a as predicate type, and sf:Point as object (Source -> Constant, and Point as its value). Then click -> again to add another nested triple, which will be the point that we will construct. As its predicate, enter geo:asWKT (Source -> Constant, then prefix geo and constant asWKT. For the object, we will use the row_index:

Hint

Besides using the values from the columns, we can also use the row index, for example: if we did not have a column for the ID in our data (Trcid), we could use the row index to construct an IRI.

Let’s edit this point. In the object field, set the RDF type to Literal and chose GREL as a source. Type the following GREL expression: "http://www.opengis.net/def/crs/OGC/1.3/CRS84 POINT (" + cells["Longitude"].value.replace(',', '.') + " " + cells["Latitude"].value.replace(',', '.')  + ")". It specifies how the value for Point will be constructed. Note that we combine the values of two columns from our table data here – Latitude and Longitude, to construct one single GeoSPARQL point from them. While typing a GREL expression, a preview of the results is shown to guide you. Click the info icon to reach the GREL documentation.

Further, for Literal attributes we will set Datatype to configure the Datatype for this Literal. In the Constant field, add prefix geo and then constant wktLiteral to add the geo:wktLiteral Datatype to our Point.

_images/ontoRefine-geosparql-point.png

This concludes our example of several basic parameters that can often be used when RDFizing structured data.

_images/ontoRefine-complete-mapping.png

The RDF end result should look like this:

@base <http://example/base/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix schema: <http://schema.org/> .
@prefix geo: <http://www.opengis.net/ont/geosparql#> .
@prefix amsterdam: <https://data.amsterdam.nl/resource/> .
@prefix sf: <http://www.opengis.net/ont/sf#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dbo: <http://dbpedia.org/ontology/> .

<https://data.amsterdam.nl/resource/restaurant/669d7d82-8962-4e88-b2e1-7b8706633aa0>
  schema:title "Smits Noord-Zuid Hollandsch Koffiehuis", "Smits Noord-Zuid Hollandsch Koffiehuis"@en;
  schema:description "Het Smits Koffiehuis ontleent haar ontstaan aan de stoomtram die de verbinding onderhield met Amsterdam naar het noorden van de provincie en is in 1919 gebouwd. Nu is er een restaurant en een koffiebar. Ook is hier een informatiekantoor van Amsterdam Marketing gehuisvest.";
  schema:latitude "52.3775440"^^xsd:float;
  amsterdam:zipcode "1012 AB";
  schema:image <https%3A//media.iamsterdam.com/ndtrc/Images/20101122/ec8faec5-5cd5-43d6-b0fa-eb0dab65e278.jpg>;
  geo:hasGeometry <https://data.amsterdam.nl/resource/geometry/669d7d82-8962-4e88-b2e1-7b8706633aa0> .

RDFize data using SPARQL

If you are a more proficient SPARQL user and want to configure your data in a different way, OntoRefine provides that option as well. The SPARQL button will open a new SPARQL Query & Update endpoint where a CONSTRUCT query based on the newly configured RDF model is generated.

You can also download the mapping SPARQL query results in various formats, including JSON, Turtle, and TriG.

_images/ontoRefine-sparql-view.png

Benchmarks

The below table illustrates the expected scalability performance of OntoRefine during data load, operations, and export. The tests have been conducted on a 6-core/12-thread system with NVMe drive, and 2, 4, 8, and 16 gigabytes of RAM, respectively.

For the purpose of the test, we have used a tool that:

  1. Generates a CSV file with 4 columns.

  2. Creates a project and uploads the file.

  3. Performs some operations, such as creating and removing a column (see them in detail here).

  4. Exports the project in CSV format.

Number of lines

2GB RAM

4GB RAM

8GB RAM

16GB RAM

load (s)

operations (s)

export (s)

load (s)

operations (s)

export (s)

load (s)

operations (s)

export (s)

load (s)

operations (s)

export (s)

500,000

0.80

3.98

0.38

0.81

4.31

0.41

0.76

4.37

0.41

0.70

4.00

0.38

1,000,000

1.72

8.27

0.82

1.72

8.48

0.80

1.51

8.48

0.79

1.38

8.20

0.75

1,500,000

2.39

21.97

1.34

2.2

24.10

1.24

2.15

12.53

1.18

2.14

11.86

1.16

2,000,000

3.15

18.13

1.74

3.08

33.72

1.63

2.93

16.17

1.66

2.79

15.59

1.55

The next table shows the expected scalability performance of OntoRefine mapping editor during the generation and downloading of an RDF file with the mapping results, as well as during the execution of the generated SPARQL CONSTRUCT query.

About 20 transformations have been made, including:

  • Adding type to value from a column;

  • Creating a new predicate with values from columns as objects;

  • The same as the above two, but using GREL and prefixes;

  • Reusing IRI cell to add triples (children);

  • Using literal transformations on many columns;

  • Using prefixes from the selected repository, adding new ones, and using them in OntoRefine.

The tests have also been conducted on a 6-core/12-thread system with NVMe drive, and 20 gigabytes of RAM (Xmx20g). The dataset contains 56 columns.

RDF

SPARQL CONSTRUCT

Number of rows

Execution time (ms)

Execution time (min)

Execution time (ms)

Execution time (min)

100,000

11,26

0,19

16,33

0,27

200,000

21,58

0,36

28,49

0,47

500,000

53,70

0,89

70,15

1,17

1,000,000

110,33

1,84

168,70

2,81

2,000,000

223,13

3,72

331,70

5,53