Loading data using the Preload tool

Preload is a tool for converting RDF files into GraphDB indices into a very low level. A common use case is the initial load of datasets larger than several billion RDF statements with no inference. Preload can perform only an initial load, which is transactional, supports stop requests, resume and consistent output even after failure. On a standard server with NVMe drive or fast SSD disks, it can sustain a data loading speed of over 130,000 RDF statements per second with no speed degradation.

Preload vs LoadRDF

Despite of the many similarities between LoadRDF and Preload, such as the fact that both tools do parallel offline transformation of RDF files into GraphDB image, there are also substantial differences in their implementation. LoadRDF uses an algorithm very similar to online data loading. As the data variety grows, the loading speed starts to drop, because the page splits and the tree is rebalancing. After a continuous data load, the disk image becomes fragmented in the same way as it would happen if the RDF files were imported into the engine.

The Preload tool eliminates the performance drop by implementing a two-phase load. In the first phase, all RDF statements are processed in-memory in chunks, which are later flushed on the disk as many GraphDB images. Then, all sorted chunks are merged into a single non-fragmented repository image with a merge join algorithm. Thus, the Preload tool requires almost twice as much disk space to complete the import.

Command line option

usage: PreloadData [OPTION]... [FILE]...
    Loads data in newly created repository or overwrites existing one.
     -a,--iter.cache <arg>         chunk iterator cache size. The value will be multiplied by 1024, default is 'auto' e.g. calculated by the tool
     -b,--chunk <arg>              chunk size for partial sorting of the queues. Use 'm' for millions or 'k' for thousands, default is 'auto' e.g. calculated by the tool
     -c,--configFile <file_path>   repo definition .ttl file
     -f,--force                    overwrite existing repo
     -i,--id <repository-id>       existing repository id
     -p,--partialLoad              allow partial load of file that contains corrupt line
     -q,--queue.folder <arg>       where to store temporary data
     -r,--recursive                walk folders recursively
     -s,--stopOnFirstError         stop process if the dataset contains a corrupt file
     -t,--parsing.tasks <arg>      number of rdf parsers
     -x,--restart                  restart load, ignoring an existing recovery point
     -y,--interval <arg>           recover point interval in seconds

There are two common cases for loading data with the Preload tool:

Loading data in a repository created from the Workbench

  1. Configure Preload repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

  2. Start GraphDB.

  3. Start a browser and go to the Workbench Web application using a URL of this form: http://localhost:7200. Substitute localhost and the 7200 port number as appropriate.

  4. Go to Setup-> Repositories.

  5. Create and configure a repository.

  6. Shut down GraphDB.

  7. Start the bulk load with the following command:

    $ <graphdb-dist>/bin/preload -f -i <repo-name> <RDF data file(s)>
    
  8. Start GraphDB.

Loading data in a new repository initialized by a config file

  1. Stop GraphDB.

  2. Configure Preload repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

  3. Create a configuration file.

  4. Start the bulk load with the following command:

    $ <graphdb-dist>/bin/preload -c <repo-config.ttl> <RDF data file(s)>
    
  5. Start GraphDB.

A GraphDB repository configuration sample

This is an example configuration template, using a minimal parameters set. You can add more optional parameters from the configs/templates example.

#
# Configuration template for a GraphDB-Free repository
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;
    rep:repositoryID "repo-test-1" ;
    rdfs:label "My first test repo" ;
    rep:repositoryImpl [
        rep:repositoryType "graphdb:FreeSailRepository" ;
        sr:sailImpl [
            sail:sailType "graphdb:FreeSail" ;

                        # ruleset to use
                        owlim:ruleset "empty" ;

                        # disable context index(because my data do not uses contexts)
                        owlim:enable-context-index "false" ;

                        # indexes to speed up the read queries
                        owlim:enablePredicateList "true" ;
                        owlim:enable-literal-index "true" ;
                        owlim:in-memory-literal-properties "true" ;
        ]
    ].

Tuning Preload

The Preload tool accepts command line options to fine tune its operation.

  • chunk - the size of the in-memory buffer to sort RDF statements before flushing it to the disk. A bigger chunk consumes additional RAM and reduces the number of chunks to merge. We recommend the default value of 20 million for datasets of up to 20 billion RDF statements.

  • iter.cache - the number of triples to cache from each chunk during the merge phase. A bigger value is likely to eliminate the I/O wait time at the cost of more RAM. We recommend the default value of 64,000 for datasets of up to 20 billion RDF statements.

  • parsing.tasks - the number of parsing tasks controls how many parallel threads parse the input files.

  • queue.folder - the parameter controls the file system location, where all temporary chunks are stored.

Resuming data loading with Preload

The loading of a huge dataset is a long batch processing, and every run may take many hours. Preload supports resuming of the process if something goes wrong (insufficient disk space, out of memory, etc.) and the loading is terminated abnormally. In this case, the data processing will restart from intermediate restore point instead of at the beginning. The data collected for the restore points is sufficient to initialize all internal components correctly and to continue the load normally at that moment, thus saving time. The following options can be used to configure data resuming:

  • interval - sets the recovery point interval in seconds. The default is 3600s. (60min.)

  • restart - if set to true, the loading will start from the beginning, ignoring an existing recovery point. The default is false.