Loading data using the ImportRDF tool

ImportRDF is a tool designed for offline loading of datasets. It cannot be used against a running server. Rationale for an offline tool is to achieve an optimal performance for loading large amounts of RDF data by directly serializing them into GraphDB’s internal indexes and producing a ready-to-use repository.

The ImportRDF tool resides in the bin folder of the GraphDB distribution. It loads data in a new repository created from the Workbench or the standard configuration Turtle file found in configs/templates, or uses an existing repository. In the latter case, the repository data is automatically overwritten.

Note

Before using the below methods, make sure you have set up a valid GraphDB license.

Load vs Preload

The ImportRDF tool supports two sub-commands - Load and Preload (supported as separate commands in GraphDB versions 9.x and older).

Despite the many similarities between Load and Preload, such as the fact that both commands do parallel offline transformation of RDF files into GraphDB image, there are also substantial differences in their implementation. Load uses an algorithm very similar to online data loading. As the data variety grows, the loading speed starts to drop, because the page splits and the tree is rebalancing. After a continuous data load, the disk image becomes fragmented in the same way as it would happen if the RDF files were imported into the engine.

Preload eliminates the performance drop by implementing a two-phase load. In the first phase, all RDF statements are processed in-memory in chunks, which are later flushed on the disk as many GraphDB images. Then, all sorted chunks are merged into a single non-fragmented repository image with a merge join algorithm. Thus, the Preload sub-command requires almost twice as much disk space to complete the import.

Preload does not perform inference on the data.

Warning

During the bulk load, the GraphDB plugins are ignored in order to speed up the process. Afterwards, when the server is started, the plugin data can be rebuilt.

Note

The ImportRDF Tool supports valid RDF, .zip and .gz files, and directories.

Loading data

There are two ways for loading data with the ImportRDF tool:

Into a repository created from the Workbench

  1. Configure the ImportRDF repository location by setting the property graphdb.home.data in <conf/graphdb.properties. If no property is set, the default repository location will be the data directory of the GraphDB distribution.

  2. Start GraphDB.

  3. In a browser, open the Workbench web application at http://localhost:7200. If necessary, substitute localhost and the 7200 port number as appropriate.

  4. Go to Setup ‣ Repositories.

  5. Create and configure a repository.

  6. Stop GraphDB.

  7. Start the bulk load with the following command:

    $ <graphdb-dist>/bin/importrdf load -f -i <repo-id> -m parallel <RDF data file(s)>
    

    or if using the preload sub-command:

    $ <graphdb-dist>/bin/importrdf preload -f -i <repo-id> <RDF data file(s)>
    
  8. Start GraphDB.

Into a repository created from a config file

  1. Stop GraphDB.

  2. Configure the ImportRDF repository location by setting the property graphdb.home.data in <conf/graphdb.properties. If no property is set, the default repository location will be the data directory of the GraphDB distribution.

  3. Create a configuration file.

  4. Start the bulk load with the following command:

    $ <graphdb-dist>/bin/importrdf load -c <repo-config.ttl> -m parallel <RDF data file(s)>
    

    or if using the preload sub-command:

    $ <graphdb-dist>/bin/importrdf preload -f -c <repo-config.ttl> <RDF data file(s)>
    
  5. Start GraphDB.

Repository configuration template

This is an example configuration template using a minimal parameters set. You can add more optional parameters from the configs/templates example:

# Configuration template for a GraphDB repository

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix graphdb: <http://www.ontotext.com/trree/graphdb#>.

[] a rep:Repository ;
    rep:repositoryID "repo-test-1" ;
    rdfs:label "My first test repo" ;
    rep:repositoryImpl [
    rep:repositoryType "graphdb:SailRepository" ;
    sr:sailImpl [
        sail:sailType "graphdb:Sail" ;

            # ruleset to use
            graphdb:ruleset "empty" ;

            # disable context index(because my data do not uses contexts)
            graphdb:enable-context-index "false" ;

            # indexes to speed up the read queries
            graphdb:enablePredicateList "true" ;
            graphdb:enable-literal-index "true" ;
            graphdb:in-memory-literal-properties "true" ;
    ]
].

Tuning Load

The ImportRDF tool accepts Java command line options using -D. Supply them before the sub-command as follows:

$ <graphdb-dist>/bin/importrdf -Dgraphdb.inference.concurrency=6 load -c <repo-config.ttl> -m parallel <RDF data file(s)>

The following options are used to fine-tune the behavior of the Load sub-command:

  • -Dgraphdb.inference.buffer: the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:

    • less buffer size reduces the memory required;

    • bigger buffer size reduces the overhead as the operations performed by threads have a lower probability of waiting for the operations on which they rely, and the CPU is intensively used most of the time.

  • -Dgraphdb.inference.concurrency: the number of inference threads in parallel mode. The default value is the number of cores of the machine processor. A bigger pool theoretically means faster load if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.

Tuning Preload

The Preload sub-command accepts the following options to fine-tune its operation:

  • --chunk: the size of the in-memory buffer to sort RDF statements before flushing it to the disk. A bigger chunk consumes additional RAM and reduces the number of chunks to merge. We recommend the default value of 20 million for datasets of up to 20 billion RDF statements.

  • --iterator-cache: the number of triples to cache from each chunk during the merge phase. A bigger value is likely to eliminate the I/O wait time at the cost of more RAM. We recommend the default value of 64,000 for datasets of up to 20 billion RDF statements.

  • --parsing-tasks: the number of parsing tasks controls how many parallel threads parse the input files.

  • --queue-folder: the parameter controls the file system location, where all temporary chunks are stored.

Resuming data loading with Preload

The loading of a huge dataset is a long batch processing, and every run may take many hours. Preload supports resuming of the process if something goes wrong (insufficient disk space, out of memory, etc.) and the loading is terminated abnormally. In this case, the data processing will restart from intermediate restore point instead of at the beginning. The data collected for the restore points is sufficient to initialize all internal components correctly and to continue the load normally at that moment, thus saving time. The following options can be used to configure data resuming:

  • --interval: sets the recovery point interval in seconds. The default is 3,600s (60min).

  • --restart: if set to true, the loading will start from the beginning, ignoring an existing recovery point. The default is false.