LoadRDF tool

The LoadRDF tool is designed for fast loading of large data sets. It resides in the bin/ folder of the GraphDB distribution. The LoadDRF tool creates a new repository from standard configuration turtle file or uses an existing one. However, the repository data dir is overwritten in either case. It it also important to note that during the bulk load, the GraphDB plugins are ignored.


To create new repository from config file and load the data into it:

bin/loadrdf -c <config.ttl> -m <serial|parallel> <files...>

To load the data in existing repository:

bin/loadrdf -i <repository-id> -m <serial|parallel> <files...>

Usage manual:

usage: loadrdf [OPTION]... [FILE]...
Loads data in newly created or exiting empty repository.
 -c,--configFile <file_path>   repo definition .ttl file
 -f,--force                    overwrite existing repo
 -i,--id <repository-id>       existing repository id
 -m,--mode <serial|parallel>   singlethread | multithread parse/load/infer
 -p,--partialLoad              allow partial load of file that contains
                               corrupt line
 -s,--stopOnFirstError         stop process if the dataset contains a
                               corrupt file

As input on the command line, the LoadRDF tool accepts a standard config file in Turtle format (as an alternative, already defined repository), the mode, and a list of files for loading.

The mode specifies the way the data is loaded in the repository:

  • serial - parsing is followed by entity resolution, which is then followed by load, optionally followed by inference, all done in a single thread.

  • parallel - using multi-thread parse and load, but instead of starting the inference at the end, the inference is made also parallel during the load. This gives a significant boost when loading large data sets with enabled inference.


    At the moment, this mode does not support the owl:sameAs optimisation and will abort the load if the disable-sameAs repository parameter is not set to true.


A zip archive of files is supported.
Gzipped files (.gz) are also supported, e.g. file.nt.gz.
In addition to files, if specified, whole directories can be processed recursively.

Java -D cmdline options

The LoadRDF tool accepts java command line options, using -D. To change them, edit the command line script.

The following options can tune the behaviour of the parallel loading:

  • -Dpool.size - how many sorting threads to use per index, after resolving entities into IDs and prior to loading the data in the repository. Sorting accelerates data loading. There are separate thread pools for each of the indices: PSO (sorts in the pred-subj-obj order), POS (sorts in the pred-obj-subj order), and if context indices are enabled: PCSO, PCOS.


    The value of this parameter defaults to 1. It is, for the moment, more of an experimental option as our experience shows that more than one sorting thread does not have a better effect. This is because the resolving stage takes much more time and the sorting stage is an in-memory operation. When the ‘quick sort’ algorithm is used, the operation is performed really fast even for large buffers.

  • -Dpool.buffer.size - the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:

    • less buffer size reduces the memory required;
    • bigger buffer size reduces the overhead as the operations performed by threads have a lower probability to wait for the operations on which they rely and the CPU is intensively used most of the time.
  • -Dlru.cache.type - the valid values are synch/lockfree, the default value is synch. It determines which type of the ‘least recently used’ cache will be used. The recommended value for LoadRDF is lockfree as it performs better in the parallel mode than the serial one. The option is set in the command line scripts.

  • -Dinfer.pool.size - the number of inference threads in parallel mode. The default value is the number of cores of the machine processor or 4, as set in the command line scripts. A bigger pool theoretically means faster load, if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.

Using loadrdf class programmatically

Because the LoadRDF tool uses parallel loading, there is also a way to use loadrdf class programmatically. For example, you can write you own java tool, which uses parallel loading internally.

One option is to give the parallel loading InputStream as a parameter. Parallel loading parses the file internally and fills the buffers of statements for the next stage (resolving), which then prepares the resolved statements for the next stage (sorting) and, finally, the sorted statements are asynchronously loaded in PSO and POS.

Another way to use the parallel loading is to specify Iterator<Statement> (parsed from another sources or possibly generated) instead of InputStream, or a File. Both constructors require to be supplied with the context where the statements will be loaded. Only statements without a context will go to the specified one and if you use a format supporting contexts (.trig, .trix, .nq), only statements without a specified context will go in the one you want. Statements with contexts will use their own context rather than the one you have additionally specified.

Parallel loading accepts two -D command line options used for testing purposes - to measure the overhead of parsing and resolving vs loading data in the repository:

  • -Ddo.resolve.entities=false - only parses the data, does not proceed to resolving;
  • -Ddo.load.data=false - parses and resolves the data, but does not proceed to loading in the repository;


By default, the data is parsed, resolved and loaded in the repository.

If any of these options is specified, a descriptive message is shown in the console.

Parallel inference

The diagram below reflects the design of the parallel reasoning mechanism, which works in combination with the fast bulk load.



The workflow of parse and load is indicated with blue arrows. The workflow of reasoning is indicated with red arrows.

  1. The thread that stores statements in the PSO collection is modified to put the statements that were actually added (i.e., the ones that did not exist in the repository) in a buffer. This is where the statements that should be used for the input of the inference process are collected. The buffer is attached to the PSO collection, because (a) unlike the Context-related collections, it is always there and (b) the POS collection already takes care to update the statistics, so, it is a way of distributing the load between the POS and PSO storing threads.
  2. There are N identical threads that do inference, each of which consuming statements from the buffer, which is populated at the previous step. This process starts when the storing of all statements from the buffer, which is currently being loaded, is completed in all collections. If the inference starts before that, there is a chance of missing valid inferences.
  3. Each of the inferencers feeds newly inferred statements in a queue (eventually, blocking queue) - this is a simple structure that should allow multiple threads to put statements into it with minimal contention.
  4. There is a separate thread that consumes statements form the inference results queue and produces a sorted buffer of statements, without duplications.
  5. One ‘epoch’ of inference is done, when (a) the inferencers exhaust all the statements from the buffer of newly added statements, (b) all the inference threads are done and (c) the sorter and deduplicaion thread exhaust the inference results queue and complete the generation of the buffer of sorted unique inference results.
  6. The Storage Controller starts a new iteration of storage and inference epoch as it starts to consume statements from the buffer of sorted unique inferred statements.
  7. The inference process is done when, in the end of an inference epoch, the buffer of sorted unique inferred statements is empty.


The Storage Controller does not start processing the next “buffer of resolved statements” until the inference process is finished for the previous one. In other words, storage and inference work on shifts and never run in parallel.

There is space for further parallelisation here and there, but we consider the implementation of this algorithm the right starting point, as it seems to be a good balance between parallelisation and simplicity. Therefore, it is manageable to implement and debug within few days. It should serve as a formal proof that this type of parallel inference can generate correct results. It will also provide a base-line for speed.