Data loading & query optimisations¶
What’s in this document?
The life-cycle of a repository instance typically starts with the initial loading of datasets, followed by the processing of queries and updates. The loading of a large dataset can take a long time - up to 12 hours for a billion statements with inference. Therefore, during loading, it is often helpful to use a different configuration than the one for a normal operation.
Furthermore, if you frequently load a certain dataset, since it gradually changes over time, the loading configuration can evolve as you become more familiar with the GraphDB behaviour towards this dataset. Many dataset properties only become apparent after the initial load (such as the number of unique entities) and this information can be used to optimise the loading step for the next round or to improve the configuration for a normal operation.
Dataset loading¶
The following is a typical initialisation life-cycle:
- Configure a repository for best loading performance with many estimated parameters.
- Load data.
- Examine dataset properties.
- Refine loading configuration.
- Reload data and measure improvement.
Unless the repository has to answer queries during the initialisation phase, it can be configured with the minimum number of options and indices:
enablePredicateList = false (unless the dataset has a large number of predicates)
enable-context-index = false
in-memory-literal-properties = false
Normal operation¶
The size of the data structures used to index entities is directly related to the number of unique entities in the loaded dataset. These data structures are always kept in memory. In order to get an upper bound on the number of unique entities loaded and to find the actual amount of RAM used to index them, it is useful to know the contents of the storage folder.
The total amount of memory needed to index entities is equal to the sum
of the sizes of the files entities.index
and entities.hash
. This
value can be used to determine how much memory is used and therefore how
to divide the remaining memory between the cache-memory, etc.
An upper bound on the number of unique entities is given by the size of
entities.hash
divided by 12 (memory is allocated in pages and
therefore the last page will likely not be full).
The file entities.index
is used to look up entries in the file
entities.hash
and its size is equal to the value of the
entity-index-size
parameter multiplied by 4. Therefore, the
entity-index-size
parameter has less to do with efficient use of
memory and more with the performance of entity indexing and lookup. The
larger this value, the less collisions occur in the entities.hash
table. A reasonable size for this parameter is at least half the number
of unique entities. However, the size of this data structure is never
changed once the repository is created, so this knowledge can only be
used to adjust this value for the next clean load of the dataset with a
new (empty) repository.
The following parameters can be adjusted:
- entity-index-size
- Set to a large enough value.
- enablePredicateList
- Can speed up queries (and loading).
- enable-context-index
- To provide better performance when executing queries that use contexts.
- index-in-memory-literal-properties
- Whether to keep the properties of each literal in-memory.
Furthermore, the inference semantics can be adjusted by choosing a different ruleset. However, this will require a reload of the whole repository, otherwise some inferences can remain when they should not.
Note
The optional indices can be built at a later time when the repository is used for query answering. You need to experiment using typical query patterns from the user environment.
GraphDB’s optional indices¶
Predicate lists¶
Predicate lists are two indices (SP
and OP
) that can improve
performance in the following situations:
- When loading/querying datasets that have a large number of predicates;
- When executing queries or retrieving statements that use a wildcard
in the predicate position, e.g., the statement pattern:
dbpedia:Human ?predicate dbpedia:Land
.
As a rough guideline, a dataset with more than about 1000 predicates
will benefit from using these indices for both loading and query
answering. Predicate list indices are not enabled by default, but can be
switched on using the enablePredicateList
configuration parameter.
Context indices¶
To provide better performance when executing queries that use contexts,
you can use two other indices - PCSO
and PSOC
. They are enabled
by using the enable-context-index
configuration parameter.
Cache/index monitoring and optimisations¶
Statistics are kept for the main index data structures and include
information such as cache hits/misses, file reads
/writes
, etc. This
information can be used to fine-tune GraphDB memory configuration and
can be useful for ‘debugging’ certain situations, such as understanding
why load performance changes over time or with particular data sets.

For each index, there will be a CollectionStatistics MBean published, which shows the cache and file I/O values updated in real-time:
Package | com.ontotext |
---|---|
MBean name | CollectionStatistics |
The following information is displayed for each MBean/index:
Attribute | Description |
---|---|
CacheHits | The number of operations completed without accessing the storage system. |
CacheMisses | The number of operations completed, which needed to access the storage system. |
FlushInvocations | |
FlushReadItems | |
FlushReadTimeAvarage | |
FlushReadTimeTotal | |
FlushWriteItems | |
FlushWriteTimeAvarage | |
FlushWriteTimeTotal | |
PageDiscards | The number of times a non-dirty page’s memory was reused to read in another page. |
PageSwaps | The number of times a page was written to the disk, so its memory could be used to load another page. |
Reads | The total number of times an index was searched for a statement or a range of statements. |
Writes | The total number of times a statement was added to a collection. |
The following operations are available:
Operation | Description |
---|---|
resetCounters | Resets all the counters for this index. |
Ideally, the system should be configured to keep the number of cache misses to a minimum. If the ratio of hits to misses is low, consider increasing the memory available to the index (if other factors permit this).
Query optimisations¶
GraphDB uses a number of query optimisation techniques by default. They
can be disabled by using the enable-optimization
configuration
parameter set to false
, however there is rarely any need to do this.
See GraphDB’s Explain Plan for a way to view query plans and applied
optimisations.
Caching literal language tags¶
This optimisation applies when the repository contains a large number of literals with language tags and it is necessary to execute queries that filter based on language, e.g., using the following SPARQL query construct:
FILTER ( lang(?name) = "ES" )
In this situation, the in-memory-literal-properties
configuration
parameters can be set to true
, causing the data values with language
tags to be cached.
Not enumerating sameAs¶
During query answering, all URIs from each equivalence class produced by
the sameAs optimisation are enumerated. You can use the
onto:disable-sameAs
pseudo-graph (see
Other special query behaviour) to significantly
reduce these duplicate results (by returning a single
representative from each equivalence class).
Consider these example queries executed against the FactForge combined dataset. Here, the default is to enumerate:
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE { ?c rdfs:subClassOf dbpedia:Airport}
producing many results:
dbpedia:Air_strip
http://sw.cyc.com/concept/Mx4ruQS1AL_QQdeZXf-MIWWdng
umbel-sc:CommercialAirport
opencyc:Mx4ruQS1AL_QQdeZXf-MIWWdng
dbpedia:Jetport
dbpedia:Airstrips
dbpedia:Airport
fb:guid.9202a8c04000641f800000000004ae12
opencyc-en:CommercialAirport
If you specify the onto:disable-sameAs
pseudo-graph:
PREFIX onto: <http://www.ontotext.com/>
PREFIX dbpedia: <http://dbpedia.org/resource/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * FROM onto:disable-sameAs
WHERE {?c rdfs:subClassOf dbpedia:Airport}
only two results are returned:
dbpedia:Air_strip
opencyc-en:CommercialAirport
The Expand results over equivalent URIs checkbox in the GraphDB Workbench SPARQL editor plays a similar role, but the meaning is reversed.
Warning
If the query uses a filter over the textual representation of a URI,
e.g., filter(strstarts(str(?x),"http://dbpedia.org/ontology"))
,
this may skip some valid solutions as not all URIs within an
equivalence class are matched against the filter.