Data Loading & Query Optimizations

The life cycle of a repository instance typically starts with the initial loading of datasets, followed by the processing of queries and updates. The loading of a large dataset can take a long time - up to 12 hours for one billion statements with inference. Therefore, during loading, it is often helpful to use a different configuration than the one for a normal operation.

Furthermore, if you frequently load a certain dataset, since it gradually changes over time, the loading configuration can evolve as you become more familiar with the GraphDB behavior towards this dataset. Many dataset properties only become apparent after the initial load (such as the number of unique entities) and this information can be used to optimize the loading step for the next round or to improve the configuration for a normal operation.

Dataset loading

The following is a typical initialization life cycle:

  1. Configure a repository for best loading performance with many estimated parameters.

  2. Load data.

  3. Examine dataset properties.

  4. Refine loading configuration.

  5. Reload data and measure improvement.

Unless the repository has to handle queries during the initialization phase, it can be configured with the minimum number of options and indexes:

enablePredicateList = false (unless the dataset has a large number of predicates)
enable-context-index = false
in-memory-literal-properties = false

Normal operation

The size of the data structures used to index entities is directly related to the number of unique entities in the loaded dataset. These data structures are always kept in memory. In order to get an upper bound on the number of unique entities loaded and to find the actual amount of RAM used to index them, it is useful to know the contents of the storage folder.

The total amount of memory needed to index entities is equal to the sum of the sizes of the files entities.index and entities.hash. This value can be used to determine how much memory is used and therefore how to divide the remaining memory between the cache memory, etc.

An upper bound on the number of unique entities is given by the size of entities.hash divided by 12 (memory is allocated in pages and therefore the last page will likely not be full).

The entities.index file is used to look up entries in the file entities.hash, and its size is equal to the value of the entity-index-size parameter multiplied by 4. Therefore, the entity-index-size parameter has less to do with efficient use of memory and more with the performance of entity indexing and lookup. The larger this value, the less collisions occur in the entities.hash table. A reasonable size for this parameter is at least half the number of unique entities. However, the size of this data structure is never changed once the repository is created, so this knowledge can only be used to adjust this value for the next clean load of the dataset with a new (empty) repository.

The following parameters can be adjusted:

Parameter

Description

entity-index-size

Set to a large enough value.

enablePredicateList

Can speed up queries (and loading).

enable-context-index

Provides better performance when executing queries that use contexts.

index-in-memory-literal-properties

Defines whether to keep the properties of each literal in-memory.

Furthermore, the inference semantics can be adjusted by choosing a different ruleset. However, this will require a reload of the whole repository, otherwise some inferences may remain in the wrong location.

Note

The optional indexes can be built at a later point when the repository is used for query answering. You need to experiment using typical query patterns from the user environment.

GraphDB’s optional indexes

Predicate lists

Predicate lists are two indexes (SP and OP) that can improve performance in the following situations:

  • When loading/querying datasets that have a large number of predicates;

  • When executing queries or retrieving statements that use a wildcard in the predicate position, e.g., the statement pattern: dbpedia:Human ?predicate dbpedia:Land.

As a rough guideline, a dataset with more than about 1,000 predicates will benefit from using these indexes for both loading and query answering. Predicate list indexes are not enabled by default, but can be switched on using the enablePredicateList configuration parameter.

Context index

To provide better performance when executing queries that use contexts, you can use the context index CPSO. It is enabled by using the enable-context-index configuration parameter.

Cache/index monitoring and optimizations

Statistics are kept for the main index data structures, and include information such as cache hits/misses, file reads/writes, etc. This information can be used to fine-tune GraphDB memory configuration, and can be useful for ‘debugging’ certain situations, such as understanding why load performance changes over time or with particular datasets.

_images/global-cache-metrics.jpg

For each index, there will be a CollectionStatistics MBean published, which shows the cache and file I/O values updated in real time:

Package

com.ontotext

MBean name

CollectionStatistics

The following information is displayed for each MBean/index:

Attribute

Description

CacheHits

The number of operations completed without accessing the storage system.

CacheMisses

The number of operations completed, which needed to access the storage system.

FlushInvocations

FlushReadItems

FlushReadTimeAverage

FlushReadTimeTotal

FlushWriteItems

FlushWriteTimeAverage

FlushWriteTimeTotal

PageDiscards

The number of times a non-dirty page’s memory was reused to read in another page.

PageSwaps

The number of times a page was written to the disk, so its memory could be used to load another page.

Reads

The total number of times an index was searched for a statement or a range of statements.

Writes

The total number of times a statement was added to a collection.

The following operations are available:

Operation

Description

resetCounters

Resets all the counters for this index.

Ideally, the system should be configured to keep the number of cache misses to a minimum. If the ratio of hits to misses is low, consider increasing the memory available to the index (if other factors permit this).

Page swaps tend to occur much more often during large scale data loading. Page discards occur more frequently during query evaluation.

Query optimizations

GraphDB uses a number of query optimization techniques by default. They can be disabled by using the enable-optimization configuration parameter set to false, however there is rarely any need to do this. See GraphDB’s Explain Plan for a way to view query plans and applied optimizations.

Caching literal language tags

This optimization applies when the repository contains a large number of literals with language tags, and it is necessary to execute queries that filter based on language, e.g., using the following SPARQL query construct:

FILTER ( langMatches(lang(?name), "es") )

In this situation, the in-memory-literal-properties configuration parameters can be set to true, causing the data values with language tags to be cached.

Not enumerating sameAs

During query answering, all URIs from each equivalence class produced by the sameAs optimization are enumerated. You can use the onto:disable-sameAs pseudo-graph (see Other special query behavior) to significantly reduce these duplicate results (by returning a single representative from each equivalence class).

Consider these example queries executed against the FactForge combined dataset. Here, the default is to enumerate:

PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE { ?c rdfs:subClassOf dbpedia:Airport}

producing many results:

dbpedia:Air_strip
http://sw.cyc.com/concept/Mx4ruQS1AL_QQdeZXf-MIWWdng
umbel-sc:CommercialAirport
opencyc:Mx4ruQS1AL_QQdeZXf-MIWWdng
dbpedia:Jetport
dbpedia:Airstrips
dbpedia:Airport
fb:guid.9202a8c04000641f800000000004ae12
opencyc-en:CommercialAirport

If you specify the onto:disable-sameAs pseudo-graph:

PREFIX onto: <http://www.ontotext.com/>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * FROM onto:disable-sameAs
WHERE {?c rdfs:subClassOf dbpedia:Airport}

only two results are returned:

dbpedia:Air_strip
opencyc-en:CommercialAirport

The Expand results over equivalent URIs checkbox in the GraphDB Workbench SPARQL editor plays a similar role, but the meaning is reversed.

Warning

If the query uses a filter over the textual representation of a URI, e.g., filter(strstarts(str(?x),"http://dbpedia.org/ontology")), this may omit some valid solutions, as not all URIs within an equivalence class are matched against the filter.

Index compacting

In some cases, database indexes get fragmented over time and with the accumulation of updates. This may lead to a slowdown in data import.

Index compacting is a useful method to tackle this. To enable it, run:

INSERT DATA {
    [] <http://www.ontotext.com/compactIndexes> [] .
}

This will:

  1. Shut down the repository internally.

  2. Scan the indexes.

  3. Rebuild them.

  4. Reinitialize the repository.

Warning

Index compacting is only suitable for specific cases.