Loading data using the LoadRDF tool¶
LoadRDF is a tool designed for offline loading of data sets. It cannot be used against a running server. Rationale for an offline tool is to achieve an optimal performance for loading large amounts of RDF data by directly serializing them into GraphDB’s internal indexes and producing a ready to use repository.
The LoadRDF tool resides in the bin/
folder of the GraphDB distribution. It loads data in a new repository, created from the
workbench or the standard configuration turtle file found in configs/templates
, or uses an existing repository. In the
latter case, the repository data is automatically overwritten.
Warning
During the bulk load, the GraphDB plugins are ignored, in order to speed up the process. Afterwards, when the server is started, the plugin data can be rebuilded.
Note
For loading datasets bigger than several billion RDF statements, consider using the Preload tool
What’s in this document?
Command Line Options¶
usage: loadrdf [OPTION]... [FILE]...
Loads data in a newly created repository or overwrites an existing one.
-c,--configFile <file_path> repo definition .ttl file
-f,--force overwrite existing repo
-i,--id <repository-id> existing repository id
-m,--mode <serial|parallel> singlethread | multithread parse/load/infer
-p,--partialLoad allow partial load of file that contains
corrupt line
-s,--stopOnFirstError stop process if the dataset contains a
corrupt file
-v,--verbose print metrics during load
The mode specifies the way the data is loaded in the repository:
serial
- parsing is followed by entity resolution, which is then followed by load, optionally followed by inference, all done in a single thread.parallel
- using multi-threaded parse, entity resolution, load and inference. This gives a significant boost when loading large data sets with enabled inference.
Note
The LoadRDF Tool supports .zip
and .gz
files, and directories. If specified, the directories can be processed
recursively.
There are two common cases for loading data with the LoadRDF tool:
Load data in a repository created from the Workbench¶
Configure LoadRDF repositories location by setting the property
graphdb.home.data
in<graphdb_dist>/conf/graphdb.properties
. If no property is set, the default repositories location will be:<graphdb_dist>/data
.Start GraphDB.
Start a browser and go to the Workbench Web application using a URL of this form: http://localhost:7200. Substituting
localhost
and the7200
port number as appropriate.Set up a valid license for the GraphDB.
Go to Setup-> Repositories.
Shut down GraphDB.
Start the bulk load with following command:
$ <graphdb-dist>/bin/loadrdf -f -i <repo-name> -m parallel <RDF data file(s)>
Start GraphDB.
Load data in a new repository initialized by a config file¶
Stop GraphDB.
Configure LoadRDF repositories location by setting the property
graphdb.home.data
in<graphdb_dist>/conf/graphdb.properties
. If no property is set, the default repositories location will be:<graphdb_dist>/data
.Create a configuration file.
Make sure that a valid license has been configured for the LoadRDF tool.
Start the bulk load with following command:
$ <graphdb-dist>/bin/loadrdf -c <repo-config.ttl> -m parallel <RDF data file(s)>
Start GraphDB.
A GraphDB Repository Configuration Sample¶
Example configuration template, using minimal parameters set. However, you can add more optional parameters from the configs/templates
example:
#
# Configuration template for an GraphDB-Free repository
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.
[] a rep:Repository ;
rep:repositoryID "repo-test-1" ;
rdfs:label "My first test repo" ;
rep:repositoryImpl [
rep:repositoryType "graphdb:FreeSailRepository" ;
sr:sailImpl [
sail:sailType "graphdb:FreeSail" ;
# ruleset to use
owlim:ruleset "rdfsplus-optimized" ;
# disable context index(because my data do not uses contexts)
owlim:enable-context-index "false" ;
# indexes to speed up the read queries
owlim:enablePredicateList "true" ;
owlim:enable-literal-index "true" ;
owlim:in-memory-literal-properties "true" ;
]
].
Tuning LoadRDF¶
The LoadRDF tool accepts java command line options, using -D
. To change them, edit the command line script.
The following options can tune the behaviour of the parallel loading:
-Dpool.buffer.size
- the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:- less buffer size reduces the memory required;
- bigger buffer size reduces the overhead as the operations performed by threads have a lower probability to wait for the operations on which they rely and the CPU is intensively used most of the time.
-Dinfer.pool.size
- the number of inference threads inparallel
mode. The default value is the number of cores of the machine processor or4
, as set in the command line scripts. A bigger pool theoretically means faster load if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.