Loading data using the LoadRDF tool

The LoadRDF tool is an OFFLINE tool, designed for fast loading of large data sets. It cannot be used against a running server. Rationale for an offline tool is to achieve an optimal performance for loading large amounts of RDF data by directly serializing them into GraphDB’s internal indexes and producing a ready to use repository.

The LoadRDF tool resides in the bin/ folder of the GraphDB distribution. It loads data in a new repository, created from the workbench or the standard configuration turtle file found in configs/templates, or uses an existing repository. In the latter case, the repository data is automatically overwritten.

Warning

During the bulk load, the GraphDB plugins are ignored, in order to speed up the process. Afterwards, when the server is started, the plugin data can be rebuilded.

Command Line Options

usage: loadrdf [OPTION]... [FILE]...
Loads data in a newly created repository or overwrites an existing one.
 -c,--configFile <file_path>   repo definition .ttl file
 -f,--force                    overwrite existing repo
 -i,--id <repository-id>       existing repository id
 -m,--mode <serial|parallel>   singlethread | multithread parse/load/infer
 -p,--partialLoad              allow partial load of file that contains
                               corrupt line
 -s,--stopOnFirstError         stop process if the dataset contains a
                               corrupt file

The mode specifies the way the data is loaded in the repository:

  • serial - parsing is followed by entity resolution, which is then followed by load, optionally followed by inference, all done in a single thread.
  • parallel - using multi-threaded parse, entity resolution, load and inference. This gives a significant boost when loading large data sets with enabled inference.

Note

The LoadRDF Tool supports .zip and .gz files, and directories. If specified, the directories can be processed recursively.

Procedure

There are several typical use-cases for loading data with the LoadRDF tool:

Initial load using the workbench

  1. Configure LoadRDF repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

  2. Start GraphDB.

  3. Start a browser and go to the Workbench Web application using a URL of this form: http://localhost:7200. - substituting localhost and the 7200 port number as appropriate.

  4. Set up a valid license for the GraphDB.

  5. Go to Setup-> Repositories.

  6. Create and configure a repository.

  7. Shut down GraphDB.

  8. Start the bulk load with following command:
    $ <graphdb-dist>/bin/loadrdf -f -i <repo-name> -m parallel <RDF data file(s)>
    
    $ <graphdb-dist>/bin/loadrdf -f -i <repo-name> -m serial <RDF data file(s)>
    
  9. Start GraphDB.

Initial load using a config file

  1. Stop GraphDB.

  2. Configure LoadRDF repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

  3. Create a configuration file.

  4. Make sure that a valid license has been configured for the LoadRDF tool.

  5. Start the bulk load with following command:
    $ <graphdb-dist>/bin/loadrdf -c <repo-config.ttl> -m parallel <RDF data file(s)>
    
    $ <graphdb-dist>/bin/loadrdf -c <repo-config.ttl> -m serial <RDF data file(s)>
    
  6. Start GraphDB.

Initial load into an independent data location

Note

It does not depend on whether GraphDB server is running or not.

  1. Change the graphdb.home.data location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties.

  2. Start the bulk load with following command (this load will read the changed configuration without influencing the running server):
    $ <graphdb-dist>/bin/loadrdf -c <repo-config.ttl> -m parallel <RDF data file(s)>
    
    $ <graphdb-dist>/bin/loadrdf -c <repo-config.ttl> -m serial <RDF data file(s)>
    
  3. Restore graphdb.home.data original location.

  4. Choose a repository where you want to deploy the loaded repository or create one using the same <repo-config.ttl> configuration.

    Note

    In case, you choose an existing repository, make sure it is not active and have the same <repo-config.ttl> configuration.

  5. Replace the repository’s data (/storage directory) with the corresponding loaded /storage directory.

A GraphDB Repository Configuration Sample

Example configuration template, using minimal parameters set. However, you can add more optional parameters from the configs/templates example:

#
# Configuration template for an GraphDB-Free repository
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;
    rep:repositoryID "repo-test-1" ;
    rdfs:label "My first test repo" ;
    rep:repositoryImpl [
        rep:repositoryType "graphdb:FreeSailRepository" ;
        sr:sailImpl [
            sail:sailType "graphdb:FreeSail" ;

            # i want inference
            owlim:ruleset "owl-horst-optimized" ;

            # disable context index(because my data do not uses contexts)
            owlim:enable-context-index "false" ;

            # nice to have, will speedup the future queries
            owlim:enablePredicateList "true" ;
            owlim:enable-literal-index "true" ;
            owlim:in-memory-literal-properties "true" ;
        ]
    ].

Tuning LoadRDF

The LoadRDF tool accepts java command line options, using -D. To change them, edit the command line script.

The following options can tune the behaviour of the parallel loading:

  • -Dpool.buffer.size - the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:
    • less buffer size reduces the memory required;
    • bigger buffer size reduces the overhead as the operations performed by threads have a lower probability to wait for the operations on which they rely and the CPU is intensively used most of the time.
  • -Dinfer.pool.size - the number of inference threads in parallel mode. The default value is the number of cores of the machine processor or 4, as set in the command line scripts. A bigger pool theoretically means faster load if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.