The LoadRDF tool is designed for fast loading of large data sets. It resides in the bin/ folder of the GraphDB distribution. The LoadDRF tool loads data in a new repository, which is created from the standard configuration turtle file found in configs/templates, or uses an existing one. In both cases, the repository data directory is automatically overwritten. It is also important to note that during the bulk load, the GraphDB plugins are ignored.

## Procedure¶

To create a new repository from the config file and load data into it, execute:

bin/loadrdf -c <config.ttl> -m <serial|parallel> <files...>


To load data into an existing repository, execute:

bin/loadrdf -i <repository-id> -m <serial|parallel> <files...>


Usage manual:

usage: loadrdf [OPTION]... [FILE]...
Loads data in newly created or exiting empty repository.
-c,--configFile <file_path>   repo definition .ttl file
-f,--force                    overwrite existing repo
-i,--id <repository-id>       existing repository id
-s,--stopOnFirstError         stop process if the dataset contains a
corrupt file


As input, the LoadRDF tool accepts a standard config TTL file (or an existing repository ID), the mode, and a list of files for loading.

The mode specifies the way the data is loaded in the repository. It can be:

• serial - parsing is followed by entity resolution, which is then followed by load, optionally followed by inference, all done in a single thread.
• parallel - using multi-threaded parse, entity resolution, load and inference. This gives a significant boost when loading large data sets with enabled inference.

Note

A zip archive of files is supported.
Gzipped files (.gz) are also supported, e.g. file.nt.gz.
In addition to files, if specified, whole directories can be processed recursively.

## Java -D cmdline options¶

The LoadRDF tool accepts java command line options, using -D. To change them, edit the command line script.

• -Dpool.buffer.size - the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:
• -Dinfer.pool.size - the number of inference threads in parallel mode. The default value is the number of cores of the machine processor or 4, as set in the command line scripts. A bigger pool theoretically means faster load, if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.