# Storage¶

## What is GraphDB’s persistence strategy¶

GraphDB stores all of its data (statements, indexes, entity pool, etc.) in files in the configured storage directory, usually called storage. The content and names of these files is not defined and is subject to change between versions.

There are several types of indices available, all of which apply to all triples, whether explicit or implicit. These indices are maintained automatically.

In general, the index structures used in GraphDB are chosen and optimised to allow for efficient:

• handling of billions of statements under reasonable RAM constraints;
• query optimisation;
• transaction management.

GraphDB maintains two main indices on statements for use in inference and query evaluation: the predicate-object-subject (POS) index and the predicate-subject-object (PSO) index. There are many other additional data structures that are used to enable the efficient manipulation of RDF data, but these are not listed, since these internal mechanisms cannot be configured.

## GraphDB’s indexing options¶

There are indexing options that offer considerable advantages for specific datasets, retrieval patterns and query loads. Most of them are disabled by default, so you need to enable them as necessary.

Note

Unless stated otherwise, GraphDB allows you to switch indices on and off against an already populated repository. The repository has to be shut down before the change of the configuration is specified. The next time the repository is started, GraphDB will create or remove the corresponding index. If the repository is already loaded with a large volume of data, switching on a new index can lead to considerable delays during initialisation – this is the time required for building the new index.

### Transaction control¶

Transaction support is exposed via RDF4J’s RepositoryConnection interface. The three methods of this interface that give you control when updates are committed to the repository are as follows:

Method Effect
void begin() Begins a transaction. Subsequent changes effected through update operations will only become permanent after commit() is called.
void commit() Commits all updates that have been performed through this connection since the last call to begin().
void rollback() Rolls back all updates that have been performed through this connection since the last call to begin().

GraphDB supports the so called ‘read committed’ transaction isolation level, which is well-known to relational database management systems - i.e., pending updates are not visible to other connected users, until the complete update transaction has been committed. It guarantees that changes will not impact query evaluation before the entire transaction they are part of is successfully committed. It does not guarantee that execution of a single transaction is performed against a single state of the data in the repository. Regarding concurrency:

• Multiple update/modification/write transactions can be initiated and stay open simultaneously, i.e., one transaction does not need to be committed in order to allow another transaction to complete;
• Update transactions are processed internally in sequence, i.e., GraphDB processes the commits one after another;
• Update transactions do not block read requests in any way, i.e., hundreds of SPARQL queries can be evaluated in parallel (the processing is properly multi-threaded) while update transactions are being handled on separate threads.

Note

GraphDB performs materialisation, ensuring that all statements that can be inferred from the current state of the repository are indexed and persisted (except for those compressed due to the Optimisation of owl:sameAs). When the commit method is completed, all reasoning activities related to the changes in the data introduced by the corresponding transaction will have already been performed.

Note

An uncommitted transaction will not affect the ‘view’ of the repository through any connection, including the connection used to do the modification. This is perhaps not in keeping with most relational database implementations. However, committing a modification to a semantic repository involves considerably more work, specifically the computation of the changes to the inferred closure resulting from the addition or removal of explicit statements. This computation is only carried out at the point where the transaction is committed and so to be consistent, neither the inferred statements nor the modified statements related to the transaction are ‘visible’.

### Predicate lists¶

Certain datasets and certain kinds of query activities, for example, queries that use wildcard patterns for predicates, benefit from another type of index called a ‘predicate list’, i.e.:

• subject-predicate (SP)
• object-predicate (OP)

This index maps from entities (subject or object) to their predicates. It is not switched on by default (see the enablePredicateList configuration parameter), because it is not always necessary. Indeed, for most datasets and query loads, the performance of GraphDB without such an index is good enough even with wildcard-predicate queries, and the overhead of maintaining this index is not justified. You should consider using this index for datasets that contain a very large number (greater than around 1000) of different predicates.

### Context indices¶

There are two more optional indices that can be used to speed up query evaluation when searching statements via their context identifier. These indices are the PCSO and the PCOS indices and they are switched on together (see the enable-context-index configuration parameter).

### Literal index¶

GraphDB automatically builds a literal index allowing faster look-ups of numeric and date/time object values. The index is used during query evaluation only if a query or a subquery (e.g., union) has a filter that is comprised of a conjunction of literal constraints using comparisons and equality (not negation or inequality), e.g., FILTER(?x = 100 && ?y <= 5 && ?start > "2001-01-01"^^xsd:date).

Other patterns will not use the index, i.e., filters will not be re-written into usable patterns.

For example, the following FILTER patterns will all make use of the literal index:

FILTER( ?x = 7 )
FILTER( 3 < ?x )
FILTER( ?x >= 3 && ?y <= 5 )
FILTER( ?x > "2001-01-01"^^xsd:date )


whereas these FILTER patterns will not:

FILTER( ?x > (1 + 2) )
FILTER( ?x < 3 || ?x > 5 )
FILTER( (?x + 1) < 7 )
FILTER( ! (?x < 3) )


The decision of the query-optimiser whether to make use of this index is statistics-based. If the estimated number of matches for a filter constraint is large relative to the rest of the query, e.g., a constraint with large or one-sided range, then the index might not be used at all.

To disable this index during query evaluation, use the enable-literal-index configuration parameter. The default value is true.

Note

Because of the way the literals are stored, the index with dates far in the future and far in the past (approximately 200,000,000 years) as well as numbers beyond the range of 64-bit floating-point representation (i.e., above approximately 1e309 and below -1e309) will not work properly.

### Handling of explicit and implicit statements¶

As already described, GraphDB applies the inference rules at load time in order to compute the full closure. Therefore, a repository will contain some statements that are explicitly asserted and other statements that exist through implication. In most cases, clients will not be concerned with the difference, however there are some scenarios when it is useful to work with only explicit or only implicit statements. These two groups of statements can be isolated during programmatic statement retrieval using the RDF4J API and during (SPARQL) query evaluation.

#### Retrieving statements with the RDF4J API¶

The usual technique for retrieving statements is to use the RepositoryConnection method:

RepositoryResult<Statement> getStatements(
Resource subj,
URI pred,
Value obj,
boolean includeInferred,
Resource... contexts)


The method retrieves statements by ‘triple pattern’, where any or all of the subject, predicate and object parameters can be null to indicate wildcards.

To retrieve explicit and implicit statements, the includeInferred parameter must be set to true. To retrieve only explicit statements, the includeInferred parameter must be set to false.

However, the RDF4J API does not provide the means to enable only the retrieval of implicit statements. In order to allow clients to do this, GraphDB allows the use of the special ‘implicit’ pseudo-graph with this API, which can be passed as the context parameter.

The following example shows how to retrieve only implicit statements:

RepositoryResult<Statement> statements =
repositoryConnection.getStatements(
null, null, null, true,
SimpleValueFactory.getInstance().createIRI("http://www.ontotext.com/implicit"));

while (statements.hasNext()) {
Statement statement = statements.next();
// Process statement
}
statements.close();


The above example uses wildcards for subject, predicate and object and will therefore return all implicit statements in the repository.

#### SPARQL query evaluation¶

GraphDB also provides mechanisms to differentiate between explicit and implicit statements during query evaluation. This is achieved by associating statements with two pseudo-graphs (explicit and implicit) and using special system URIs to identify these graphs.

Tip