LoadRDF tool

The LoadRDF tool is designed for fast loading of large data sets. It resides in the bin/ folder of the GraphDB distribution. The LoadDRF tool loads data in a new repository, which is created from the standard configuration turtle file found in configs/templates, or uses an existing one. In both cases, the repository data directory is automatically overwritten. It is also important to note that during the bulk load, the GraphDB plugins are ignored.

Procedure

To create a new repository from the config file and load data into it, execute:

bin/loadrdf -c <config.ttl> -m <serial|parallel> <files...>

To load data into an existing repository, execute:

bin/loadrdf -i <repository-id> -m <serial|parallel> <files...>

Usage manual:

usage: loadrdf [OPTION]... [FILE]...
Loads data in newly created or exiting empty repository.
 -c,--configFile <file_path>   repo definition .ttl file
 -f,--force                    overwrite existing repo
 -i,--id <repository-id>       existing repository id
 -m,--mode <serial|parallel>   singlethread | multithread parse/load/infer
 -p,--partialLoad              allow partial load of file that contains
                               corrupt line
 -s,--stopOnFirstError         stop process if the dataset contains a
                               corrupt file

As input, the LoadRDF tool accepts a standard config TTL file (or an existing repository ID), the mode, and a list of files for loading.

The mode specifies the way the data is loaded in the repository. It can be:

  • serial - parsing is followed by entity resolution, which is then followed by load, optionally followed by inference, all done in a single thread.
  • parallel - using multi-threaded parse, entity resolution, load and inference. This gives a significant boost when loading large data sets with enabled inference.

Note

A zip archive of files is supported.
Gzipped files (.gz) are also supported, e.g. file.nt.gz.
In addition to files, if specified, whole directories can be processed recursively.

Java -D cmdline options

The LoadRDF tool accepts java command line options, using -D. To change them, edit the command line script.

The following options can tune the behaviour of the parallel loading:

  • -Dpool.buffer.size - the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:
    • less buffer size reduces the memory required;
    • bigger buffer size reduces the overhead as the operations performed by threads have a lower probability to wait for the operations on which they rely and the CPU is intensively used most of the time.
  • -Dinfer.pool.size - the number of inference threads in parallel mode. The default value is the number of cores of the machine processor or 4, as set in the command line scripts. A bigger pool theoretically means faster load, if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.