Loading data using the LoadRDF tool¶
What’s in this document?
LoadRDF is a tool designed for offline loading of datasets. It cannot be used against a running server. Rationale for an offline tool is to achieve an optimal performance for loading large amounts of RDF data by directly serializing them into GraphDB’s internal indexes and producing a ready-to-use repository.
The LoadRDF tool resides in the bin/
folder of the GraphDB distribution. It loads data in a new repository created from the
Workbench or the standard configuration Turtle file found in configs/templates
, or uses an existing repository. In the
latter case, the repository data is automatically overwritten.
Warning
During the bulk load, the GraphDB plugins are ignored in order to speed up the process. Afterwards, when the server is started, the plugin data can be rebuilt.
Note
For loading datasets bigger than several billion RDF statements, consider using the Preload tool.
Command line options¶
usage: loadrdf [OPTION]... [FILE]...
Loads data in a newly created repository or overwrites an existing one.
-c,--configFile <file_path> repo definition .ttl file
-f,--force overwrite existing repo
-i,--id <repository-id> existing repository id
-m,--mode <serial|parallel> singlethread | multithread parse/load/infer
-p,--partialLoad allow partial load of file that contains
corrupt line
-s,--stopOnFirstError stop process if the dataset contains a
corrupt file
-v,--verbose print metrics during load
The mode specifies the way the data is loaded in the repository:
serial
: parsing is followed by entity resolution, which is then followed by load, optionally followed by inference, all done in a single thread.parallel
: using multi-threaded parse, entity resolution, load and inference. This gives a significant boost when loading large datasets with enabled inference.
Note
The LoadRDF Tool supports .zip
and .gz
files, and directories. If specified, the directories can be processed
recursively.
Note
To be able to load data using the below methods, make sure to set up a valid GraphDB license after installing GraphDB.
There are two common cases for loading data with the LoadRDF tool:
Load data in a repository created from the Workbench¶
Configure LoadRDF repositories location by setting the property
graphdb.home.data
in<graphdb_dist>/conf/graphdb.properties
. If no property is set, the default repositories location will be<graphdb_dist>/data
.Start GraphDB.
Start a browser and go to the Workbench Web application using a URL of this form: http://localhost:7200. Substitute
localhost
and the7200
port number as appropriate.Go to
.Shut down GraphDB.
Start the bulk load with the following command:
$ <graphdb-dist>/bin/loadrdf -f -i <repo-name> -m parallel <RDF data file(s)>
Start GraphDB.
Load data in a new repository initialized by a config file¶
Stop GraphDB.
Configure LoadRDF repositories location by setting the property
graphdb.home.data
in<graphdb_dist>/conf/graphdb.properties
. If no property is set, the default repositories location will be<graphdb_dist>/data
.Create a configuration file.
Start the bulk load with the following command:
$ <graphdb-dist>/bin/loadrdf -c <repo-config.ttl> -m parallel <RDF data file(s)>
Start GraphDB.
A GraphDB repository configuration sample¶
Example configuration template, using minimal parameters set. However, you can add more optional parameters from the configs/templates
example:
#
# Configuration template for an GraphDB-EE repository
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.
[] a rep:Repository ;
rep:repositoryID "repo-test-1" ;
rdfs:label "My first test repo" ;
rep:repositoryImpl [
rep:repositoryType "owlim:ReplicationClusterWorker" ;
rep:delegate [
rep:repositoryType "owlim:MonitorRepository" ;
sr:sailImpl [
sail:sailType "owlimClusterWorker:Sail" ;
# ruleset to use
owlim:ruleset "rdfsplus-optimized" ;
# disable context index(because my data do not uses contexts)
owlim:enable-context-index "false" ;
# indexes to speed up the read queries
owlim:enablePredicateList "true" ;
owlim:enable-literal-index "true" ;
owlim:in-memory-literal-properties "true" ;
]
]
].
Tuning LoadRDF¶
The LoadRDF tool accepts java command line options using -D
. To change them, edit the command line script.
The following options can tune the behavior of the parallel loading:
-Dpool.buffer.size
: the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:less buffer size reduces the memory required;
bigger buffer size reduces the overhead as the operations performed by threads have a lower probability to wait for the operations on which they rely, and the CPU is intensively used most of the time.
-Dinfer.pool.size
: the number of inference threads inparallel
mode. The default value is the number of cores of the machine processor or4
, as set in the command line scripts. A bigger pool theoretically means faster load if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.