Loading data using OntoRefine

OntoRefine - overview and features

GraphDB OntoRefine is a data transformation tool, based on OpenRefine and integrated in the GraphDB Workbench. It can be used for converting tabular data into RDF and importing it into a GraphDB repository, using simple SPARQL queries and a virtual endpoint. The supported formats are TSV, CSV, *SV, XLS, XLSX, JSON, XML, RDF as XML, and Google sheet. Using OntoRefine, you can easily filter your data, edit its inconsistencies, convert it into RDF, and import it into a repository.

OntoRefine enables you to:

  • Upload your data file(s) and create a project.
  • View the cleaned data as RDF.
  • Transform your data using SPIN functions.
  • Import the newly created RDF directly in a GraphDB repository by using the GraphDB SPARQL endpoint.

Example Data

The data used for the examples can be found at:

Upload data in OntoRefine

Open OntoRefine in the Workbench

To transform your data into RDF, you need a working GraphDB database.

  1. Start GraphDB in Workbench mode.
  2. Open http://localhost:7200/ in a browser.
  3. Go to Import -> Tabular (OntoRefine).

All data files in OntoRefine are organized as projects. One project can have more than one data file.

The Create Project action area consists of three tabs corresponding to the source of data. You can upload a file from your computer, specify the URL of a publicly accessible data, or paste data from the clipboard.

Create a project

  1. Click Create Project -> Get data from.

  2. Select one or more files to upload:

    • from your computer

      _images/ontoRefine-upload-file.png
    • from web addresses (URLs)

      _images/ontoRefine-upload-file-url.png
    • from clipboard

      _images/ontoRefine-upload-file-clipboard.png
  3. Click Next.

  4. (Optional) Change the table configurations and update the preview.

    With the first opening of the file, OntoRefine tries to recognize the encoding of the text file and all delimiters.

    _images/ontoRefine-table-configurations.png
  5. Click Create Project.

    The result is a table similar to that of an Excel or a Google sheet.

Import a project

To import an already existing OntoRefine project:

  1. Go to Import Project.

  2. Select a file (.tar or .tar.gz)

  3. Import it.

    _images/ontoRefine-import-project.png

Open a project

Once the project is created:

  1. Go to Open Project.
  2. Click the one you want to work on.
  3. (Optional) You can also delete your project if you want to.
_images/ontoRefine-rename-delete.png

Viewing tabular data as RDF

Unlike OpenRefine that supports RDF for an input, OntoRefine also supports RDF as an output. To view your data in triples, click the OntoRefine RDF button. A settings screen appears to configure the RDF data that OntoRefine helps you generate.

_images/refineRDFSettings.png

You can change the IRIs generated for your data. Leaving the defaults will return the following query; click Run to obtain the results.

_images/refineRowsResults.png

Warning

The OntoRefine SPARQL button sends queries only to the currently open OntoRefine project. It should not be mistaken with the GraphDB Workbench SPARQL tab, which is directly connected to the current repository.

RDFizing data

You can transform your data in the OntoRefine SPARQL endpoint using a CONSTRUCT query.

Using CONSTRUCT query in the OntoRefine SPARQL endpoint

Construct new RDF data based on the RDFized model of your tabular data. Use SPIN functions to construct proper triples.

GraphDB 8.O supports SPIN functions:

  • SPARQL functions for splitting a string
  • SPARQL functions for parsing dates
  • SPARQL functions for encoding URIs

Parsing dates and encoding URIs

PREFIX mydata: <http://example.com/resource/>
PREFIX spif: <http://spinrdf.org/spif#>

# Example query returning RDF data
CONSTRUCT  {
  ?presidentIRI a mydata:President ;
        mydata:tookOffice ?tookOfficeParsed ;
        mydata:leftOffice ?leftOfficeParsed ;
        mydata:nominatedBy ?party ;
        mydata:homeState ?stateIRI
} WHERE {
    # Triple patterns for accessing each row and the columns in contains
    # Note that no triples will be generated for NULL values in the table
    # You should inspect your data in Refine mode and add OPTIONAL accordingly
    ?row a mydata:Row ;
        mydata:Presidency ?Presidency ;
        mydata:President ?President ;
        mydata:Wikipedia_Entry ?Wikipedia_Entry ;
        mydata:Took_office ?Took_office ;
        mydata:Left_office ?Left_office ;
        mydata:Party ?Party ;
        mydata:Portrait ?Portrait ;
        mydata:Thumbnail ?Thumbnail ;
        mydata:Home_State ?Home_State .

   # Uses SPIN function to parse the dates
    bind(spif:parseDate(?Took_office, "dd/MM/yyyy") as ?tookOfficeParsed)
    bind(spif:parseDate(?Left_office, "dd/MM/yyyy") as ?leftOfficeParsed)
    # Uses several functions to construct IRIs for the presidents and their states
    bind(iri(concat("http://example.com/", spif:encodeURL(?President))) as ?presidentIRI)
    bind(iri(concat("http://example.com/", spif:encodeURL(?Home_State))) as ?stateIRI)
} LIMIT 100

The rows and columns from the table are mapped with the help of the following URIs:

  • mydata:Row - the RDF type for each row;
  • mydata:rowNumber - the property for row number as an integer literal;
  • mydata:<column_name> - properties for each column

Importing data in GraphDB

Once you are satisfied with the transformation of your data, you can import it in the current repository without leaving the GraphDB Workbench.

  1. Click Data -> Open in main SPARQL editor

    _images/refineOpenInMainSparql.png
  2. The query is the same as the previous one, only with the addition of a SERVICE clause. You only have to change CONSTRUCT to INSERT and remove the LIMIT. Instead of showing the RDF, GraphDB will insert it into the current repository.

    _images/refineInsertFromRefine.png
  3. Execute the query to import the results.

Rows and records

OpenRefine can view the same data both as rows and as records. See https://github.com/OpenRefine/OpenRefine/wiki/Variables#record and http://kb.refinepro.com/2012/03/difference-between-record-and-row.html.

Records

For the purpose of the demo, we can create records for each party. Go to Party column -> Edit column -> Move column to the beginning. By clicking Party column -> Text facet, we can see that some values contain spaces.

_images/refineFacetBefore.png
  1. Clean them by Party column -> Edit cells -> Transform and type value.trim(). Observe how the text facet is updated.
_images/refineTrimValues.png
  1. Now click Party column -> Sort -> OK. Just use the defaults.
  2. Then Sort -> Reorder rows permanently
_images/refineSortPermanently.png
  1. And Party column -> Edit cells -> Blank down. Click on Records. The result looks like this:
_images/refineRecord.png
  1. Now, the RDF button leads to the following query and results:
_images/refineRecordsQuery.png

When there are records, the following URIs will be used in addition to the row ones:

  • mydata:recordNNN - identifies a record, NNN is the 1-based record number;
  • mydata:Record - the RDF type for each record;
  • mydata:recordId - the property for record id (the value from the first column that creates an OpenRefine record);
  • mydata:hasRow - the property that links records to rows.

Splitting content in a cell

Load the movie dataset for this example. Sometimes you may have multiple values in the same cell. You can easily split them and generate separate URIs using SPIF functions.

PREFIX mydata: <http://example.com/resource/>
PREFIX spif: <http://spinrdf.org/spif#>

SELECT * {
    ?row a mydata:Row ;
        mydata:genres ?genres ;
} LIMIT 100
_images/ontoRefine-splitting-cells1.png
PREFIX mydata: <http://example.com/resource/>
PREFIX spif: <http://spinrdf.org/spif#>

CONSTRUCT {
  ?m a mydata:Movie ;
    mydata:hasGenre ?genre
} WHERE {
  ?m a mydata:Row ;
     mydata:genres ?genres .
  ?genre spif:split (?genres "\\|")
} LIMIT 100
_images/ontoRefine-splitting-cells2.png