Loading data using the Preload tool

Preload is a tool for converting RDF files into GraphDB indexes into a very low level. A common use case is the initial load of datasets bigger than several billion RDF statements with no inference. Preload can perform only an initial load, which is transactional, supports stop requests, resume and consistent output even after failure. On a standard server with NVMe drive or fast SSD disks it can sustain a data loading speed of over 130K RDF statements per second with no speed degradation.

Preload vs LoadRDF

Despite there are many similarities between LoadRDF and Preload like both tools does parallel offline transformation of RDF files into GraphDB image, there are very substantial differences in their implementation. LoadRDF uses an algorithm very similar to online data loading. As the data variety grows, the loading speed starts to drop, because the page splits and tree rebalancing. After a continuous data load, the disk image becomes fragmented in the same way as it would happen if the RDF files were imported into the engine.

Preload tool eliminates the performance drop by implementing a two-phase load. In the first phase, all RDF statements are processed in memory in chunks, which are later flushed on the disk as many GraphDB images. Then, all sorted chunks are merged into a single non-fragmented repository image with a merge join algorithm. Thus, the Preload tool requires nearly two times more disk space to complete the import.

Command Line Option

usage: PreloadData [OPTION]... [FILE]...
    Loads data in newly created repository or overwrites existing one.
     -a,--iter.cache <arg>         chunk iterator cache size. The value will be multiplied by 1024, default is 'auto' e.g. calculated by the tool
     -b,--chunk <arg>              chunk size for partial sorting of the queues. Use 'm' for millions or 'k' for thousands, default is 'auto' e.g. calculated by the tool
     -c,--configFile <file_path>   repo definition .ttl file
     -f,--force                    overwrite existing repo
     -i,--id <repository-id>       existing repository id
     -p,--partialLoad              allow partial load of file that contains corrupt line
     -q,--queue.folder <arg>       where to store temporary data
     -r,--recursive                walk folders recursively
     -s,--stopOnFirstError         stop process if the dataset contains a corrupt file
     -t,--parsing.tasks <arg>      number of rdf parsers
     -x,--restart                  restart load, ignoring an existing recovery point
     -y,--interval <arg>           recover point interval in seconds

There are two common cases for loading data with the Preload tool:

Loading data in a repository created from the Workbench

  1. Configure Preload repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

  2. Start GraphDB.

  3. Start a browser and go to the Workbench Web application using a URL of this form: http://localhost:7200. Substituting localhost and the 7200 port number as appropriate.

  4. Set up a valid license for the GraphDB.

  5. Go to Setup-> Repositories.

  6. Create and configure a repository.

  7. Shut down GraphDB.

  8. Start the bulk load with following command:

    $ <graphdb-dist>/bin/preload -f -i <repo-name> <RDF data file(s)>
    
  9. Start GraphDB.

Loading data in a new repository initialized by a config file

  1. Stop GraphDB.

  2. Configure Preload repositories location by setting the property graphdb.home.data in <graphdb_dist>/conf/graphdb.properties. If no property is set, the default repositories location will be: <graphdb_dist>/data.

  3. Create a configuration file.

  4. Make sure that a valid license has been configured for the Preload tool.

  5. Start the bulk load with following command:

    $ <graphdb-dist>/bin/preload -c <repo-config.ttl> <RDF data file(s)>
    
  6. Start GraphDB.

A GraphDB Repository Configuration Sample

Example configuration template, using minimal parameters set. However, you can add more optional parameters from the configs/templates example:

#
# Configuration template for an GraphDB-Free repository
#
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rep: <http://www.openrdf.org/config/repository#>.
@prefix sr: <http://www.openrdf.org/config/repository/sail#>.
@prefix sail: <http://www.openrdf.org/config/sail#>.
@prefix owlim: <http://www.ontotext.com/trree/owlim#>.

[] a rep:Repository ;
    rep:repositoryID "repo-test-1" ;
    rdfs:label "My first test repo" ;
    rep:repositoryImpl [
        rep:repositoryType "graphdb:FreeSailRepository" ;
        sr:sailImpl [
            sail:sailType "graphdb:FreeSail" ;

                        # ruleset to use
                        owlim:ruleset "empty" ;

                        # disable context index(because my data do not uses contexts)
                        owlim:enable-context-index "false" ;

                        # indexes to speed up the read queries
                        owlim:enablePredicateList "true" ;
                        owlim:enable-literal-index "true" ;
                        owlim:in-memory-literal-properties "true" ;
        ]
    ].

Tuning Preload

The Preload tool accepts command line options to fine tune its operation.

  • chunk - the size of the in-memory buffer to sort RDF statements before flushing it to the disk. A bigger chunk consumes additional RAM and reduces the number of chunks to merge. We recommend the default value of 20M for datasets up to 20B RDF statements.
  • iter.cache - the number of triples to cache from each chunk during the merge phase. A bigger value is likely to eliminate the IO wait time at the cost of more RAM. We recommend the default value of 64K for datasets up to 20B RDF statements.
  • parsing.tasks - the number of parsing tasks controls how many parallel threads parse the input files.
  • queue.folder - the parameter controls the file system location, where all temporary chunks are stored.

Resuming data loading with Preload

The loading of a huge dataset is a long batch processing and every run may take many hours. Preload supports resuming of the process if something goes wrong (insufficient disk space, out of memory, etc.) and the work is abnormally terminated. In that case, the data processing will restart from intermediate restore point instead of the beginning. The data collected for the restore points is sufficient to initialize all internal components correctly and continue the load normally at that moment, thus saving time. The following options can be used to configure data resuming:

  • interval - set the recover point interval in seconds. The default is 3600s. (60min.)
  • restart - if set to true the loading will start from the beginning, ignoring an existing recovery point. The default is false