Loading Data Using the ImportRDF Tool¶
What’s in this document?
ImportRDF is a tool designed for offline loading of datasets. It cannot be used against a running server. Rationale for an offline tool is to achieve an optimal performance for loading large amounts of RDF data by directly serializing them into GraphDB’s internal indexes and producing a ready-to-use repository.
The ImportRDF tool resides in the bin
folder of the GraphDB distribution. It loads data in a new repository created from the
Workbench or the standard configuration Turtle file found in configs/templates
, or uses an existing repository. In the
latter case, the repository data is automatically overwritten.
Note
Before using the below methods, make sure you have set up a valid GraphDB license.
Important
The ImportRDF tool cannot be used in a cluster setup as it would break the cluster consistency.
Load vs Preload¶
The ImportRDF tool supports two sub-commands - Load and Preload (supported as separate commands in GraphDB versions 9.x and older).
Despite the many similarities between Load and Preload, such as the fact that both commands do parallel offline transformation of RDF files into GraphDB image, there are also substantial differences in their implementation. Load uses an algorithm very similar to online data loading. As the data variety grows, the loading speed starts to drop, because the page splits and the tree is rebalancing. After a continuous data load, the disk image becomes fragmented in the same way as it would happen if the RDF files were imported into the engine.
Preload eliminates the performance drop by implementing a two-phase load. In the first phase, all RDF statements are processed in-memory in chunks, which are later flushed on the disk as many GraphDB images. Then, all sorted chunks are merged into a single non-fragmented repository image with a merge join algorithm. Thus, the Preload sub-command requires almost twice as much disk space to complete the import.
Preload does not perform inference on the data.
Warning
During the bulk load, the GraphDB plugins are ignored in order to speed up the process. Afterwards, when the server is started, the plugin data can be rebuilt.
Note
The ImportRDF Tool supports valid RDF, .zip
and .gz
files, and directories.
Command line options¶
See the supported Load command line options.
See the supported Preoad command line options.
Loading data¶
There are two ways for loading data with the ImportRDF tool:
Into a repository created from the Workbench¶
Configure the ImportRDF repository location by setting the property
graphdb.home.data
in<conf/graphdb.properties
. If no property is set, the default repository location will be thedata
directory of the GraphDB distribution.Start GraphDB.
In a browser, open the Workbench web application at http://localhost:7200. If necessary, substitute
localhost
and the7200
port number as appropriate.Go to
.Stop GraphDB.
Start the bulk load with the following command:
$ <graphdb-dist>/bin/importrdf load -f -i <repo-id> -m parallel <RDF data file(s)>
or if using the preload sub-command:
$ <graphdb-dist>/bin/importrdf preload -f -i <repo-id> <RDF data file(s)>
Start GraphDB.
Into a repository created from a config file¶
Stop GraphDB.
Configure the ImportRDF repository location by setting the property
graphdb.home.data
in<conf/graphdb.properties
. If no property is set, the default repository location will be thedata
directory of the GraphDB distribution.Create a configuration file.
Start the bulk load with the following command:
$ <graphdb-dist>/bin/importrdf load -c <repo-config.ttl> -m parallel <RDF data file(s)>
or if using the preload sub-command:
$ <graphdb-dist>/bin/importrdf preload -f -c <repo-config.ttl> <RDF data file(s)>
Start GraphDB.
Repository configuration template¶
This is an example configuration template using a minimal parameters set. You can add more optional parameters from the configs/templates
example:
# Configuration template for a GraphDB repository
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix graphdb: <http://www.ontotext.com/trree/graphdb#>.
[] a rep:Repository ;
rep:repositoryID "repo-test-1" ;
rdfs:label "My first test repo" ;
rep:repositoryImpl [
rep:repositoryType "graphdb:SailRepository" ;
sr:sailImpl [
sail:sailType "graphdb:Sail" ;
# ruleset to use
graphdb:ruleset "empty" ;
# disable context index(because my data do not uses contexts)
graphdb:enable-context-index "false" ;
# indexes to speed up the read queries
graphdb:enablePredicateList "true" ;
graphdb:enable-literal-index "true" ;
graphdb:in-memory-literal-properties "true" ;
]
].
Tuning Load¶
The ImportRDF tool accepts Java command line options using -D
. Supply them before the sub-command as follows:
$ <graphdb-dist>/bin/importrdf -Dgraphdb.inference.concurrency=6 load -c <repo-config.ttl> -m parallel <RDF data file(s)>
The following options are used to fine-tune the behavior of the Load sub-command:
-Dgraphdb.inference.buffer
: the buffer size (the number of statements) for each stage. Defaults to 200,000 statements. You can use this parameter to tune the memory usage and the overhead of inserting data:less buffer size reduces the memory required;
bigger buffer size reduces the overhead as the operations performed by threads have a lower probability of waiting for the operations on which they rely, and the CPU is intensively used most of the time.
-Dgraphdb.inference.concurrency
: the number of inference threads inparallel
mode. The default value is the number of cores of the machine processor. A bigger pool theoretically means faster load if there are enough unoccupied cores and the inference does not wait for the other load stages to complete.
Tuning Preload¶
The Preload sub-command accepts the following options to fine-tune its operation:
--chunk
: the size of the in-memory buffer to sort RDF statements before flushing it to the disk. A bigger chunk consumes additional RAM and reduces the number of chunks to merge. We recommend the default value of 20 million for datasets of up to 20 billion RDF statements.--iterator-cache
: the number of triples to cache from each chunk during the merge phase. A bigger value is likely to eliminate the I/O wait time at the cost of more RAM. We recommend the default value of 64,000 for datasets of up to 20 billion RDF statements.--parsing-tasks
: the number of parsing tasks controls how many parallel threads parse the input files.--queue-folder
: the parameter controls the file system location, where all temporary chunks are stored.
Resuming data loading with Preload¶
The loading of a huge dataset is a long batch processing, and every run may take many hours. Preload supports resuming of the process if something goes wrong (insufficient disk space, out of memory, etc.) and the loading is terminated abnormally. In this case, the data processing will restart from intermediate restore point instead of at the beginning. The data collected for the restore points is sufficient to initialize all internal components correctly and to continue the load normally at that moment, thus saving time. The following options can be used to configure data resuming:
--interval
: sets the recovery point interval in seconds. The default is 3,600s (60min).--restart
: if set totrue
, the loading will start from the beginning, ignoring an existing recovery point. The default isfalse
.