Loading data using the LoadRDF tool

LoadRDF is a tool designed for offline loading of data sets. It cannot be used against a running server. Rationale for an offline tool is to achieve an optimal performance for loading large amounts of RDF data by directly serializing them into GraphDB’s internal indexes and producing a ready to use repository.

The LoadRDF tool resides in the bin/ folder of the GraphDB distribution. It loads data in a new repository, created from the workbench or the standard configuration turtle file found in configs/templates, or uses an existing repository. In the latter case, the repository data is automatically overwritten.

Warning

During the bulk load, the GraphDB plugins are ignored, in order to speed up the process. Afterwards, when the server is started, the plugin data can be rebuilded.

Note

For loading datasets bigger than several billion RDF statements, consider using the Preload tool

Command Line Options

usage: loadrdf [OPTION]... [FILE]...
Loads data in a newly created repository or overwrites an existing one.
 -c,--configFile <file_path>   repo definition .ttl file
 -f,--force                    overwrite existing repo
 -i,--id <repository-id>       existing repository id
 -m,--mode <serial|parallel>   singlethread | multithread parse/load/infer
 -p,--partialLoad              allow partial load of file that contains
                               corrupt line
 -s,--stopOnFirstError         stop process if the dataset contains a
                               corrupt file
 -v,--verbose                  print metrics during load

The mode specifies the way the data is loaded in the repository:

  • serial - parsing is followed by entity resolution, which is then followed by load, optionally followed by inference, all done in a single thread.
  • parallel - using multi-threaded parse, entity resolution, load and inference. This gives a significant boost when loading large data sets with enabled inference.

Note

The LoadRDF Tool supports .zip and .gz files, and directories. If specified, the directories can be processed recursively.

There are two common cases for loading data with the LoadRDF tool:

Load data in a repository created from the Workbench

  1. Configure LoadRDF repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

  2. Start GraphDB.

  3. Start a browser and go to the Workbench Web application using a URL of this form: http://localhost:7200. Substituting localhost and the 7200 port number as appropriate.

  4. Set up a valid license for the GraphDB.

  5. Go to Setup-> Repositories.

  6. Create and configure a repository.

  7. Shut down GraphDB.

  8. Start the bulk load with following command:

    $ <graphdb-dist>/bin/loadrdf -f -i <repo-name> -m parallel <RDF data file(s)>
    
  9. Start GraphDB.

Load data in a new repository initialized by a config file

  1. Stop GraphDB.

  2. Configure LoadRDF repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

  3. Create a configuration file.

  4. Make sure that a valid license has been configured for the LoadRDF tool.

  5. Start the bulk load with following command:

    $ <graphdb-dist>/bin/loadrdf -c <repo-config.ttl> -m parallel <RDF data file(s)>
    
  6. Start GraphDB.

A GraphDB Repository Configuration Sample

Example configuration template, using minimal parameters set. However, you can add more optional parameters from the configs/templates example:

#
# Configuration template for an GraphDB-Free repository
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;
    rep:repositoryID "repo-test-1" ;
    rdfs:label "My first test repo" ;
    rep:repositoryImpl [
        rep:repositoryType "graphdb:FreeSailRepository" ;
        sr:sailImpl [
            sail:sailType "graphdb:FreeSail" ;

                        # ruleset to use
                        owlim:ruleset "rdfsplus-optimized" ;

                        # disable context index(because my data do not uses contexts)
                        owlim:enable-context-index "false" ;

                        # indexes to speed up the read queries
                        owlim:enablePredicateList "true" ;
                        owlim:enable-literal-index "true" ;
                        owlim:in-memory-literal-properties "true" ;
        ]
    ].

Tuning LoadRDF

The LoadRDF tool accepts java command line options, using -D. To change them, edit the command line script.

The following options can tune the behaviour of the parallel loading:

  • -Dpool.buffer.size - the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:
    • less buffer size reduces the memory required;
    • bigger buffer size reduces the overhead as the operations performed by threads have a lower probability to wait for the operations on which they rely and the CPU is intensively used most of the time.
  • -Dinfer.pool.size - the number of inference threads in parallel mode. The default value is the number of cores of the machine processor or 4, as set in the command line scripts. A bigger pool theoretically means faster load if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.