# Elasticsearch GraphDB connector¶

Note

This feature requires a GraphDB Enterprise license.

## Overview and features¶

The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically implemented by an external component or a service such as Elasticsearch but have the additional benefit of staying automatically up-to-date with the GraphDB repository data.

The Connectors provide synchronization at the entity level, where an entity is defined as having a unique identifier (an IRI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains. A property chain is defined as a sequence of triples where each triple’s object is the subject of the following triple.

The main features of the GraphDB Connectors are:

• maintaining an index that is always in sync with the data stored in GraphDB;

• multiple independent instances per repository;

• the entities for synchronization are defined by:

• a list of fields (on the Elasticsearch side) and property chains (on the GraphDB side) whose values will be synchronized;

• a list of rdf:type’s of the entities for synchronization;

• a list of languages for synchronization (the default is all languages);

• additional filtering by property and value.

• full-text search using native Elasticsearch queries;

• snippet extraction: highlighting of search terms in the search result;

• faceted search;

• sorting by any preconfigured field;

• paging of results using OFFSET and LIMIT;

• custom mapping of RDF types to Elasticsearch types;

Each feature is described in detail below.

## Usage¶

All interactions with the Elasticsearch GraphDB Connector are done through SPARQL queries.

There are three types of SPARQL queries:

• INSERT for creating and deleting connector instances;

• SELECT for listing connector instances and querying their configuration parameters;

• INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.

In general, this corresponds to INSERT adds or modifies data and SELECT queries existing data.

Each connector implementation defines its own IRI prefix to distinguish it from other connectors. For the Elasticsearch GraphDB Connector, this is http://www.ontotext.com/connectors/elasticsearch#. Each command or predicate executed by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/elasticsearch#createConnector to create a connector instance for Elasticsearch.

Individual instances of a connector are distinguished by unique names that are also IRIs. They have their own prefix to avoid clashing with any of the command predicates. For Elasticsearch, the instance prefix is http://www.ontotext.com/connectors/elasticsearch/instance#.

Sample data

All examples use the following sample data that describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito, and Rozova, as well as the grape varieties required to make these wines. The minimum required ruleset level in GraphDB is RDFS.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix wine: <http://www.ontotext.com/example/wine#> .

wine:RedWine rdfs:subClassOf wine:Wine .
wine:WhiteWine rdfs:subClassOf wine:Wine .
wine:RoseWine rdfs:subClassOf wine:Wine .

wine:Merlo
rdf:type wine:Grape ;
rdfs:label "Merlo" .

wine:CabernetSauvignon
rdf:type wine:Grape ;
rdfs:label "Cabernet Sauvignon" .

wine:CabernetFranc
rdf:type wine:Grape ;
rdfs:label "Cabernet Franc" .

wine:PinotNoir
rdf:type wine:Grape ;
rdfs:label "Pinot Noir" .

wine:Chardonnay
rdf:type wine:Grape ;
rdfs:label "Chardonnay" .

wine:Yoyowine
rdf:type wine:RedWine ;
wine:hasSugar "dry" ;
wine:hasYear "2013"^^xsd:integer .

wine:Franvino
rdf:type wine:RedWine ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .

wine:Noirette
rdf:type wine:RedWine ;
wine:hasSugar "medium" ;
wine:hasYear "2012"^^xsd:integer .

wine:Blanquito
rdf:type wine:WhiteWine ;
wine:hasSugar "dry" ;
wine:hasYear "2012"^^xsd:integer .

wine:Rozova
rdf:type wine:RoseWine ;
wine:hasSugar "medium" ;
wine:hasYear "2013"^^xsd:integer .


## Setup and maintenance¶

### Prerequisites¶

Third-party component versions

This version of the Elasticsearch GraphDB Connector uses Elasticsearch version 7.16.3.

Tip

Since version 2.0, by default Elasticsearch commits the translog at the end of every index, delete, update, or bulk request. The new configuration may causes a massive slowdown of the Elasticsearch connector, so we highly recommend to change the index.translog.durability value to async. For more information, see Elasticsearch’s transaction log settings.

Tip

In Elasticsearch 7.x.x, the default value for the wait_for_active_shards parameter of the open index command has been changed from 0 to 1. This means that the command will now by default wait for all primary shards of the opened index to be allocated. You can find more information about it here. Depending on your specific case, you can experiment with different values to find the optimal ones for you, for example: "indexCreateSettings": {"number_of_shards" : 5, "number_of_replicas" : 1, "write.wait_for_active_shards" : 0}.

### Creating a connector instance¶

Creating a connector instance is done by sending a SPARQL query with the following configuration data:

• the name of the connector instance (e.g., my_index);

• an Elasticsearch instance to synchronize to;

• classes to synchronize;

• properties to synchronize.

The configuration data has to be provided as a JSON string representation and passed together with the create command.

If you create the connector via the Workbench, no matter which way you use, you will be presented with a pop-up screen showing you the connector creation progress.

#### Using the Workbench¶

1. Go to Setup ‣ Connectors.

2. Click New Connector in the tab of the respective Connector type you want to create.

3. Fill in the configuration form.

4. Execute the CREATE statement from the form by clicking OK. Alternatively, you can view its SPARQL query by clicking View SPARQL Query, and then copy it to execute it manually or integrate it in automation scripts.

#### Using the create command¶

The create command is triggered by a SPARQL INSERT with the createConnector predicate, e.g., it creates a connector instance called my_index, which synchronizes the wines from the sample data above:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"analyzed": false
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
],
"analyzed": false
}
]
}
''' .
}


The above command creates a new Elasticsearch connector instance that connects to the Elasticsearch instance accessible at port 9200 on the localhost as specified by the elasticsearchNode key.

The "types" key defines the RDF type of the entities to synchronize and, in the example, it is only entities of the type http://www.ontotext.com/example/wine#Wine (and its subtypes). The "fields" key defines the mapping from RDF to Elasticsearch. The basic building block is the property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In the example, three bits of information are mapped - the grape the wines are made of, sugar content, and year. Each chain is assigned a short and convenient field name: “grape”, “sugar”, and “year”. The field names are later used in the queries.

Grape is an example of a property chain composed of more than one property. First, we take the wine’s madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.

The fields sugar and year contain discrete values, such as medium, dry, 2012, 2013, and thus it is best to specify the option analyzed: false as well. See analyzed in Defining fields for more information.

#### Mapping and index management¶

By default, GraphDB manages (creates, deletes, or updates if needed) the Elasticsearch index and the Elasticsearch mapping. This makes it easier to use Elasticsearch as everything is done automatically. This behavior can be changed by the following options:

• manageIndex: if true, GraphDB manages the index. True by default.

• manageMapping: if true, GraphDB manages the mapping. True by default.

Note

If either of the options is set to false, you have to create, update or remove the index/mapping and, in case Elasticsearch is misconfigured, the connector instance will not function correctly.

#### Using a non-managed schema¶

The present version provides no support for changing some advanced options, such as stop words, on a per-field basis. The recommended way to do this for now is to manage the mapping yourself and tell the connector to just sync the object values in the appropriate fields. Here is an example:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"analyzed": false
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
],
"analyzed": false
}
],
"manageMapping": false
}
''' .
}


This creates the same connector instance as above but it expects fields with the specified field names to be already present in the index mapping, as well as some internal GraphDB fields. For the example, you must have the following fields:

field name

Elasticsearch config

_graphdb_id

“type”:”long”, “index”:true, “store”:true

grape

“type”:”text”, “index”:true, “store”:true

sugar

“type”:”keyword”, “index”:true, “store”:true

year

“type”:”keyword”, “index”:true, “store”:true

_graphdb_id is used internally by GraphDB and is always required.

#### Working with secured Elasticsearch¶

GraphDB allows the access of a secured Elasticsearch instance by passing the arbitrary elasticsearchBasicAuthUser and elasticsearchBasicAuthPassword parameters.

Instead of supplying the username and password as part of the connector instance configuration, you can also implement a custom authenticator class and set it via the authenticationConfiguratorClass option. See these connector authenticator examples for more information and example projects that implement such a custom class.

See the List of creation parameters for more information.

### Dropping a connector instance¶

Dropping a connector instance removes all references to its external store from GraphDB as well as the Elasticsearch index associated with it.

The drop command is triggered by a SPARQL INSERT with the dropConnector predicate where the name of the connector instance has to be in the subject position, e.g., this removes the connector my_index:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

INSERT DATA {
elastic-index:my_index elastic:dropConnector [] .
}


You can also force drop a connector in case a normal delete does not work. The force delete will remove the connector even if part of the operation fails. Go to Setup ‣ Connectors where you will see the already existing connectors that you have created. Click the delete icon, and check Force delete in the dialog box.

### Retrieving the create options for a connector instance¶

You can view the options string that was used to create a particular connector instance with the following query:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?createString {
elastic-index:my_index elastic:listOptionValues ?createString .
}


### Listing available connector instances¶

#### In the Connectors management view¶

Existing Connector instances show under Existing connectors (below the New Connector button). Click the name of an instance to view its configuration and SPARQL query, or click the repair / delete icons to perform these operations.

#### With a SPARQL query¶

Listing connector instances returns all previously created instances. It is a SELECT query with the listConnectors predicate:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>

SELECT ?cntUri ?cntStr {
?cntUri elastic:listConnectors ?cntStr .
}


?cntUri is bound to the prefixed IRI of the connector instance that was used during creation, e.g., http://www.ontotext.com/connectors/elasticsearch/instance#my_index, while ?cntStr is bound to a string, representing the part after the prefix, e.g., "my_index".

### Instance status check¶

The internal state of each connector instance can be queried using a SELECT query and the connectorStatus predicate:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>

SELECT ?cntUri ?cntStatus {
?cntUri elastic:connectorStatus ?cntStatus .
}


?cntUri is bound to the prefixed IRI of the connector instance, while ?cntStatus is bound to a string representation of the status of the connector represented by this IRI. The status is key-value based.

## Working with data¶

### Adding, updating and deleting data¶

From the user point of view, all synchronization happens transparently without using any additional predicates or naming a specific store explicitly, i.e., you must simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.

### Simple queries¶

Once a connector instance has been created, it is possible to query data from it through SPARQL. For each matching abstract document, the connector instance returns the document subject. In its simplest form, querying is achieved by using a SELECT and providing the Elasticsearch query as the object of the query predicate:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query "grape:cabernet" ;
elastic:entities ?entity .
}


The result binds ?entity to the two wines made from grapes that have “cabernet” in their name, namely :Yoyowine and :Franvino.

Note

You must use the field names you chose when you created the connector instance. They can be identical to the property IRIs but you must escape any special characters according to what Elasticsearch expects.

1. Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector instance IRI. X is bound to a query instance of the connector instance.

2. Assign a query to the query instance by using the system predicate :query.

3. Request the matching entities through the :entities predicate.

It is also possible to provide per-query search options by using one or more option predicates. The option predicates are described in detail below.

### Raw queries¶

To access an Elasticsearch query parameter that is not exposed through a special predicate, use a raw query. Instead of providing a full-text query in the :query part, specify raw Elasticsearch parameters. For example, to boost some parts of your full-text query as described here, execute the following query:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query '''
{
"query" : {
"bool" : {
"should" : [ {
"query_string" : {
"query" : "<full-text-query-not-boosted>"
}
}, {
"query_string" : {
"query" : "<full-text-query-boosted>",
"boost" : 4.0
}
} ]
}
}
}
''' ;
elastic:entities ?entity .
}


#### Combining Elasticsearch results with GraphDB data¶

The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additional data from GraphDB, for example, to see the actual grapes in the matching wines as well as the year they were made:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>
PREFIX wine: <http://www.ontotext.com/example/wine#>

SELECT ?entity ?grape ?year {
?search a elastic-index:my_index ;
elastic:query "grape:cabernet" ;
elastic:entities ?entity .
?entity wine:madeFromGrape ?grape .
?entity wine:hasYear ?year
}


The result looks like this:

?entity

?grape

?year

:Yoyowine

:CabernetSauvignon

2013

:Franvino

:Merlo

2012

:Franvino

:CabernetFranc

2012

Note

:Franvino is returned twice because it is made from two different grapes, both of which are returned.

#### Entity match score¶

It is possible to access the match score returned by Elasticsearch with the score predicate. As each entity has its own score, the predicate should come at the entity level. For example:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?entity ?score {
?search a elastic-index:my_index ;
elastic:query "grape:cabernet" ;
elastic:entities ?entity .
?entity elastic:score ?score
}


The result looks like this but the actual score might be different as it depends on the specific Elasticsearch version:

?entity

?score

:Yoyowine

0.9442660212516785

:Franvino

0.7554128170013428

### Basic facet queries¶

Consider the sample wine data and the my_index connector instance described previously. You can also query facets using the same instance:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?facetName ?facetValue ?facetCount WHERE {
# note empty query is allowed and will just match all documents, hence no elastic:query
?r a elastic-index:my_index ;
elastic:facetFields "year,sugar" ;
elastic:facets _:f .
_:f elastic:facetName ?facetName .
_:f elastic:facetValue ?facetValue .
_:f elastic:facetCount ?facetCount .
}


It is important to specify the facet fields by using the facetFields predicate. Its value is a simple comma-delimited list of field names. In order to get the faceted results, use the facets predicate. As each facet has three components (name, value, and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates facetName, facetValue, and facetCount.

The resulting bindings will look like this:

facetName

facetValue

facetCount

year

2012

3

year

2013

2

sugar

dry

3

sugar

medium

2

You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of the wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are the same as the three dry wines as each facet is computed independently.

Tip

Faceting by analyzed textual field works but might produce unexpected results. Analyzed textual fields are composed of tokens and faceting uses each token to create a faceting bucket. For example, “North America” and “Europe” produce three buckets: “north”, “america”, and “europe”, corresponding to each token in the two values. If you need to facet by a textual field and still do full-text search on it, it is best to create a copy of the field with the setting "analyzed": false. For more information, see Copy fields.

### Advanced facet and aggregation queries¶

While basic faceting allows for simple counting of documents based on the discrete values of a particular field, there are more complex faceted or aggregation searches in Elasticsearch. The Elasticsearch GraphDB Connector provides a mapping from Elasticsearch results to RDF results but no mechanism for specifying the queries other than executing Raw queries.

#### Supported Elasticsearch facets and aggregations¶

The Elasticsearch GraphDB Connector supports mapping of the following facets and aggregations:

• Facets: terms, histogram, date histogram;

• Aggregations: terms, histogram, date histogram, range, min, max, sum, avg, stats, extended stats, value count.

For aggregations, the connector also supports sub-aggregations.

Tip

For more information on each supported facet or aggregation type, refer to the Elasticsearch documentation.

#### RDF mapping of the results¶

The results are accessed through the predicate aggregations (much like the basic facets are accessed through facets). The predicate binds multiple blank nodes that each contains a single aggregation bucket. The individual bucket items can be accessed through these predicates:

Predicate

Meaning

Elasticsearch counterpart

:name

Bucket name

getName()

:key

Key or value associated with the bucket

getValue() or getKey()

:count

Count of documents in the bucket

getDocCount(), getValue()

:from

Start of range

getFrom(), getFromAsDate()

:to

End of range (RangeFacet)

getTo(), getToAsDate()

:min

Minimum value

getMin(), getValue()

:max

Maximum value

getMax(), getValue()

:sum

Sum value

getSum(), getValue()

:avg

Average value

getAvg(), getValue()

:sum_of_squares

Sum of squares value

getSumOfSquares()

:variance

Variance value

getVariance()

:std_deviation

Standard deviation value

getStdDeviation()

:parent

Sub-aggregations: points to the parent (upper level) blank node

:level

Sub-aggregations: level number where 1 is the uppermost level and the following levels are 2, 3 and so on

:levelName

Sub-aggregations: level name

getKey() or getValue()

### Sorting¶

It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved by the orderBy predicate the value of which is a comma-delimited list of fields. Each field can be prefixed with a minus to indicate sorting in descending order. For example:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query "year:2013" ;
elastic:orderBy "-sugar" ;
elastic:entities ?entity .
}


The result contains wines produced in 2013 sorted according to their sugar content in descending order:

entity

Rozova

Yoyowine

By default, entities are sorted according to their matching score in descending order.

Note

If you join the entity from the connector query to other triples stored in GraphDB, GraphDB might scramble the order. To remedy this, use ORDER BY from SPARQL.

Tip

Sorting by an analyzed textual field works but might produce unexpected results. Analyzed textual fields are composed of tokens and sorting uses the least (in the lexicographical sense) token. For example, “North America” will be sorted before “Europe” because the token “america” is lexicographically smaller than the token “europe”. If you need to sort by a textual field and still do full-text search on it, it is best to create a copy of the field with the setting "analyzed": false. For more information, see Copy fields.

### Limit and offset¶

Limit and offset are supported on the Elasticsearch side of the query. This is achieved through the predicates limit and offset. Consider this example in which an offset of 1 and a limit of 1 are specified:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query "sugar:dry" ;
elastic:offset "1" ;
elastic:limit "1" ;
elastic:entities ?entity .
}


The result contains a single wine, Franvino. If you execute the query without the limit and offset, Franvino will be second in the list:

entity

Yoyowine

Franvino

Blanquito

Note

The specific order in which GraphDB returns the results depends on how Elasticsearch returns the matches, unless sorting is specified.

### Snippet extraction¶

Snippet extraction is used for extracting highlighted snippets of text that match the query. The snippets are accessed through the dedicated predicate snippets. It binds a blank node that in turn provides the actual snippets via the predicates snippetField and snippetText. The predicate snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?entity ?snippetField ?snippetText {
?search a elastic-index:my_index ;
elastic:query "grape:cabernet" ;
elastic:entities ?entity .
?entity elastic:snippets ?snippet .
?snippet elastic:snippetField ?snippetField ;
elastic:snippetText ?snippetText .
}


the query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:

?entity

?snippetField

?snippetText

:Yoyowine

grape

<em>Cabernet</em> Sauvignon

:Franvino

grape

<em>Cabernet</em> Franc

Note

The actual snippets might be different as this depends on the specific Elasticsearch implementation.

It is possible to tweak how the snippets are collected/composed by using the following option predicates:

• :snippetSize - sets the maximum size of the extracted text fragment, 250 by default;

• :snippetSpanOpen - the text to insert before the highlighted text, <em> by default;

• :snippetSpanClose - the text to insert after the highlighted text, </em> by default.

The option predicates are set on the query instance, much like the :query predicate.

#### Snippets from nested documents¶

Snippets extracted from nested documents (when a nested query is used) will be available through the same mechanism as snippets from non-nested fields. In addition, nested snippet results provide the nested search path via the snippetInnerField predicate. For example, in a nested search on the field “grandChildren” (specified by “path”) and a match query for “tylor” on the nested field “grandChildren.name”:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?entity ?snippetInnerField ?snippetField ?snippetText {
?search a elastic-index:my_index ;
elastic:query '''
{
"query":{
"nested":{
"path":"grandChildren",
"query":{
"bool":{
"must":[
{
"match":{
"grandChildren.name":"tylor"
}
}
]
}
}
}
}
}
''' ;
elastic:entities ?entity .
?entity elastic:snippets ?snippet .
?snippet elastic:snippetInnerField ?snippetInnerField ;
elastic:snippetField ?snippetField ;
elastic:snippetText ?snippetText .
}


the query returns all people who have a grandchild whose name matches “tylor”, as well as the highlighted snippets:

?entity

?snippetInnerField

?snippetField

?snippetText

urn:Eva

grandChildren

grandChildren.name

John-<em>Tylor</em>

urn:John

grandChildren

grandChildren.name

John-<em>Tylor</em>

urn:John

grandChildren

grandChildren.name

<em>Tylor</em>

urn:Mary

grandChildren

grandChildren.name

<em>Tylor</em>

Note that the matching field whose matching values are highlighted is provided via the snippetField predicate, just like extracting snippets with non-nested searches, while the predicate snippetInnerField provides the field on which the nested search was executed.

### Total hits¶

You can get the total number of hits by using the totalHits predicate, e.g., for the connector instance my_index and a query that retrieves all wines made in 2012:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?totalHits {
?r a elastic-index:my_index ;
elastic:query "year:2012" ;
elastic:totalHits ?totalHits .
}


As there are three wines made in 2012, the value 3 (of type xsd:long) binds to ?totalHits.

## List of creation parameters¶

The creation parameters define how a connector instance is created by the :createConnector predicate. Some are required and some are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.

All of the creation parameters can also be set conveniently from the Create Connector user interface without any knowledge of JSON.

readonly (boolean), optional, read-only mode

A read-only connector will index all existing data in the repository at creation time, but, unlike non-read-only connectors, it will:

• Not react to updates. Changes will not be synced to the connector.

• Not keep any extra structures (such as the internal Lucene index for tracking updates to chains)

The only way to index changes in data after the connector has been created is to repair (or drop/recreate) the connector.

importGraph (boolean), optional, a virtual graph containing data from which to create the connector

Creates a connector whose data will come from statements inserted into a special virtual graph instead of data contained in the repository. The virtual graph is es:graph, where the PREFIX es: <http://www.ontotext.com/connectors/elasticsearch#> control prefix is used. The data have to be inserted into this graph before the connector create statement is executed.

Both the insertion into the special graph and the create statement must be in the same transaction. In GDB Workbench, this can be done by pasting them one after another in the SPARQL editor and putting a semicolon at the end of the first INSERT. This functionality requires read-only mode.

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
INSERT {
GRAPH elastic:graph {
...
}
} WHERE {
...
};
PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index:<http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"importGraph": true,
"fields": [],
"languages": [],
"types": [],
}
''' .
}

importFile (string), optional, an RDF file with data from which to create the connector

Creates a connector whose data will come from an RDF file on the file system instead of data contained in the repository. The value must be the full path to the RDF file. This functionality requires readonly mode.

detectFields (boolean), optional, detects fields

This mode introduces automatic field detection when creating a connector. Instead of providing fields: [ ... ] in the JSON, you can skip this and get automatic fields where each field will have a single predicate chain and its field name will be the same as the predicate.

In this mode, specifying types is optional too. If types are not provided, then all types will be indexed. This mode requires importGraph or importFile.

Once the connector is created, you can inspect the detected fields in the Connector management section of the Workbench.

elasticsearchNode (string), required, the Elasticsearch instance to sync to

As Elasticsearch is a third-party service, you have to specify the node where it is running. The format of the node value is of the form http://hostname.domain:port, https:// is allowed too. No default value. Can be updated at runtime without having to rebuild the index.

Note

Elasticsearch exposes two protocols – the native transport* protocol over port 9300 and the RESTful API over port 9200. The Elasticsearch GraphDB Connector uses the RESTful API over port 9200.

indexCreateSettings (json), optional, the settings for creating the Elasticsearch index

This option is passed directly to Elasticsearch when creating the index.

elasticsearchBasicAuthUser (string), optional, the settings for supplying the authentication user

No default value. Can be updated at runtime without having to rebuild the index.

elasticsearchBasicAuthPassword (string), optional, the settings for supplying the authentication password

A password is a string with a single value that is not logged or printed. No default value. Can be updated at runtime without having to rebuild the index.

elasticsearchClusterSniff (boolean), controls whether to build the server address list by sniffing on the Elasticsearch cluster

Corresponds to the Elasticsearch client.transport.sniff option. True by default. Can be updated at runtime without having to rebuild the index.

bulkUpdateBatchSize (integer), controls the maximum number of documents sent per bulk request

Default value is 5,000. Can be updated at runtime without having to rebuild the index.

bulkUpdateRequestSize (integer), controls the maximum size in bytes per bulk request

Defaults to 5,242,880 bytes (5 million bytes). Can be updated at runtime without having to rebuild the index.

The limits of bulkUpdateBatchSize and bulkUpdateRequestSize are combined, and a bulk request is sent once either limit is hit.

authenticationConfiguratorClass optional, provides custom authentication behavior

types (list of IRIs), required, specifies the types of entities to sync

The RDF types of entities to sync are specified as a list of IRIs. At least one type IRI is required.

Use the pseudo-IRI $any to sync entities that have at least one RDF type. Use the pseudo-IRI $untyped to sync entities regardless of whether they have any RDF type, see also the examples in General full-text search with the connectors.

languages (list of strings), optional, valid languages for literals

RDF data is often multilingual, but only some of the languages represented in the literal values can be mapped. This can be done by specifying a list of language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of language ranges maps all existing literals that have matching language tags.

fields (list of field objects), required, defines the mapping from RDF to Elasticsearch

The fields specify exactly which parts of each entity will be synchronized as well as the specific details on the connector side. The field is the smallest synchronization unit and it maps a property chain from GraphDB to a field in Elasticsearch. The fields are specified as a list of field objects. At least one field object is required. Each field object has further keys that specify details.

• fieldName (string), required, the name of the field in Elasticsearch

The name of the field defines the mapping on the connector side. It is specified by the key fieldName with a string value. The field name is used at query time to refer to the field. There are few restrictions on the allowed characters in a field name but to avoid unnecessary escaping (which depends on how Elasticsearch parses its queries), we recommend to keep the field names simple.

• fieldNameTransform (one of none, predicate or predicate.localName), optional, none by default

The field name transform defines an optional transformation of the field name.

• none: The field name is supplied via the fieldName option.

• predicate: The field name is derived from the full IRI of the last predicate of the chain, e.g., if the last predicate was http://www.w3.org/2000/01/rdf-schema#label, then the field name will be http://www.w3.org/2000/01/rdf-schema#label too.

• predicate.localName: The field name is the derived from the local name of the IRI of the last predicate of the chain, e.g., if the last predicate was http://www.w3.org/2000/01/rdf-schema#comment, then the field name will be comment.

See Indexing all literals in distinct fields for an example.

• propertyChain (list of IRI), required, defines the property chain to reach the value

The property chain defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity IRI is the subject of the first triple, its object is the subject of the next triple, etc. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of IRIs where at least one IRI must be provided.

The IRI of the document will be synchronized to the special field id in Elasticsearch. You may use it to query Elasticsearch directly and to retrieve the matching entity IRI.

See Copy fields for defining multiple fields with the same property chain.

See Multiple property chains per field for defining a field whose values are populated from more than one property chain.

See Indexing language tags for defining a field whose values are populated with the language tags of literals.

See Indexing the IRI of an entity for defining a field whose values are populated with the IRI of the indexed entity.

See Wildcard literal indexing for defining a field whose values are populated with literals regardless of their predicate.

• valueFilter (string), optional, specifies the value filter for the field

• documentFilter (string), optional, specifies the nested document filter for the field

Only for fields that define nested documents). See also Entity filtering.

• defaultValue (string), optional, specifies a default value for the field

The default value (defaultValue) provides means for specifying a default value for the field when the property chain has no matching values in GraphDB. The default value can be a plain literal, a literal with a datatype (xsd: prefix supported), a literal with language, or a IRI. It has no default value.

• indexed (boolean), optional, default true

If indexed, a field is available for Elasticsearch queries. true by default.

If true, this option corresponds to "index" = true. If false, it corresponds to "index" = false.

• stored (boolean), optional, default true

Fields can be stored in Elasticsearch, and this is controlled by the Boolean option stored. Stored fields are required for retrieving snippets. true by default.

This option corresponds to the property "store" in the Elasticsearch mapping.

• analyzed (boolean), optional, default true

When literal fields are indexed in Elasticsearch, they will be analyzed according to the analyzer settings. Should you require that a given field is not analyzed, you may use "analyzed". This option has no effect for IRIs (they are never analyzed). true by default.

If true, this option will use automatic or manual (datatype option) type for the Elasticsearch mapping. If false, it corresponds to "type" = "keyword" (i.e., the default type will be changed to keyword).

• multivalued (boolean), optional, default true

RDF properties and synchronized fields may have more than one value. If multivalued is set to true, all values will be synchronized to Elasticsearch. If set to false, only a single value will be synchronized. true by default.

• ignoreInvalidValues (boolean), optional, default false

Per-field option that controls what happens when a value cannot be converted to the requested (or previously detected) type. False by default.

Example use: when an invalid date literal like “2021-02-29”^^xsd:date (2021 is not a leap year) needs to be indexed as a date, or when an IRI needs to be indexed as a number.

Note that some conversions are always valid: any literal to an FTS field, any non-literal (IRI, blank node, embedded triple) to a non-analyzed field. When true, such values will be skipped with a note in the logs. When false, such values will break the transaction.

• array (boolean), optional, default false

Normally, Elasticsearch creates an array only if more than value is present for a given field. If array is set to true, Elasticsearch will always create an array even for single values. If set to false, Elasticsearch will create arrays for multiple values only. False by default.

• fielddata (boolean), optional, default false

Allows fielddata to be built in memory for text fields. Fielddata can consume a lot of heap space, especially when loading high cardinality text fields. False by default.

• datatype (string), optional, the manual datatype override

By default, the Elasticsearch GraphDB Connector uses datatype of literal values to determine how they should be mapped to Elasticsearch types. For more information on the supported datatypes, see Datatype mapping.

The mapping can be overridden through the property "datatype", which can be specified per field. The value of datatype can be any of the xsd: types supported by the automatic mapping or a native Elasticsearch type prefixed by native:, e.g., both xsd:long and native:long map to the long type in Elasticsearch.

• nativeSettings (json), optional, custom field settings

The setting for the Elasticsearch mapping parameters of the respective field, for example the format of the datatype. Native field settings require an explicit native datatype.

nativeSettings are not allowed for the following parameters so as to avoid conflicts with the existing way to specify them: type, index, store, analyzer, fielddata.

• objectFields (objects array), optional, nested object mapping

When native:object, native:nested, or native:geo_point is used as a datatype value, provide a mapping for the nested objects fields. If datatype is not provided, then native:object will be assumed.

For the difference between object and nested, refer to the Elastic nested field type. The geo_point type must have exactly two fields named lat and long (required by Elastic, see geo-point field type).

Nested objects support further nested objects with a limit of five levels of nesting. See Nested objects for an example.

• startFromParent (integer), optional, default 0

Start processing the property chain from the N-th parent instead of the root of the current nested object. 0 is the root of the current nested object, 1 is the parent of the nested object, 2 is the parent of the parent and so on.

• analyzer (string), optional, per field analyzer

The Elasticsearch analyzer that is used for indexing the field can be specified with the parameter analyzer. It will be passed directly to Elasticsearch’s property analyzer when creating the mapping (see Custom Analyzers in the Elasticsearch documentation). For example:

{
...
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.w3.org/2000/01/rdf-schema#label"
],
"analyzer": "my_analyzer"
},
...
}

valueFilter (string), optional, specifies the top-level value filter for the document

documentFilter (string), optional, specifies the top-level document filter for the document

### Updating parameters at runtime¶

As mentioned above, the following connector parameters can be updated at runtime without having to rebuild the index:

• elasticsearchNode

• elasticsearchClusterSniff

• elasticsearchBasicAuthUser

• elasticsearchBasicAuthPassword

• bulkUpdateBatchSize

• bulkUpdateRequestSize

This can be done by executing the following SPARQL update, here with examples for changing the user and password:

PREFIX conn:<http://www.ontotext.com/connectors/elasticsearch#>
PREFIX inst:<http://www.ontotext.com/connectors/elasticsearch/instance#>
INSERT DATA {
inst:proper_index conn:updateConnector '''
{
"elasticsearchBasicAuthUser": "foo",
}
''' .
}


### Special field definitions¶

#### Nested objects¶

Nested objects are Elasticsearch documents that are used as values in the main document or other nested objects (up to five levels of nesting is possible). They are defined with the objectFields option.

Having the following data consisting of children and grand children relations:

<urn:John>
a <urn:Person> ;
<urn:name> "John" ;
<urn:gender> <urn:Male> ;
<urn:age> 60 ;
<urn:hasSpouse> <urn:Mary> ;
<urn:hasChild> <urn:Billy> ;
<urn:hasChild> <urn:Annie> .

<urn:Mary>
a <urn:Person> ;
<urn:name> "Mary" ;
<urn:gender> <urn:Female> ;
<urn:age> 58 ;
<urn:hasSpouse> <urn:John> ;
<urn:hasChild> <urn:Billy> .

<urn:Eva>
a <urn:Person> ;
<urn:name> "Eva" ;
<urn:gender> <urn:Female> ;
<urn:age> 45 ;
<urn:hasChild> <urn:Annie> .

<urn:Billy>
a <urn:Person> ;
<urn:name> "Billy" ;
<urn:gender> <urn:Male> ;
<urn:age> 35 ;
<urn:hasChild> <urn:Tylor> ;
<urn:hasChild> <urn:Melody> .

<urn:Annie>
a <urn:Person> ;
<urn:name> "Annie" ;
<urn:gender> <urn:Female> ;
<urn:age> 28 ;
<urn:hasChild> <urn:Sammy> .

<urn:Tylor>
a <urn:Person> ;
<urn:name> "Tylor" ;
<urn:gender> <urn:Male> ;
<urn:age> 5 .

<urn:Melody>
a <urn:Person> ;
<urn:name> "Melody" ;
<urn:gender> <urn:Female> ;
<urn:age> 2 .

<urn:Sammy>
a <urn:Person> ;
<urn:name> "Sammy" ;
<urn:gender> <urn:Male> ;
<urn:age> 10 .

<urn:Male> <urn:label> "male" .

<urn:Female> <urn:label> "female" .


We can create a nested objects index that consists of children and grandchildren with their corresponding fields defining their gender and age. We use the native:nested type as we want to query the nested objects independently of each other:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"fields": [
{
"fieldName": "name",
"propertyChain": [
"urn:name"
]
},
{
"fieldName": "age",
"propertyChain": [
"urn:age"
],
"datatype": "xsd:long"
},
{
"fieldName": "hasSpouse",
"propertyChain": [
"urn:hasSpouse"
]
},
{
"fieldName": "gender",
"propertyChain": [
"urn:gender",
"urn:label"
]
},
{
"fieldName": "children",
"propertyChain": [
"urn:hasChild"
],
"datatype": "native:nested",
"objectFields": [
{
"fieldName": "id",
"propertyChain": [
"$self" ] }, { "fieldName": "name", "propertyChain": [ "urn:name" ] }, { "fieldName": "age", "propertyChain": [ "urn:age" ], "datatype": "xsd:long" }, { "fieldName": "gender", "propertyChain": [ "urn:gender", "urn:label" ] }, { "fieldName": "children", "propertyChain": [ "urn:hasChild" ], "objectFields": [ { "fieldName": "id", "propertyChain": [ "$self"
]
},
{
"fieldName": "name",
"propertyChain": [
"urn:name"
]
},
{
"fieldName": "age",
"propertyChain": [
"urn:age"
],
"datatype": "xsd:long"
}
]
}
]
},
{
"fieldName": "grandChildren",
"valueFilter": "$this -> type in (<urn:Person>)", "propertyChain": [ "urn:hasChild", "urn:hasChild" ], "datatype": "native:nested", "objectFields": [ { "fieldName": "id", "propertyChain": [ "$self"
]
},
{
"fieldName": "name",
"propertyChain": [
"urn:name"
]
},
{
"fieldName": "age",
"propertyChain": [
"urn:age"
],
"datatype": "xsd:long"
},
{
"fieldName": "gender",
"propertyChain": [
"urn:gender",
"urn:label"
]
}
]
},
],
"types": [
"urn:Person"
],
"elasticsearchNode": "http://localhost:9200"
}
'''
}


To find male grandchildren age of 5 years and older, we will use the following query:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

SELECT ?entity {
?search a elastic-index:my_index ;
elastic:query '''
{
"query" : {
"nested" : {
"path" : "grandChildren",
"query" : {
"bool" : {
"must" : [
{
"match" : {
"grandChildren.gender" : "male"
}
},
{
"range" : {
"grandChildren.age" : {
"gt" : 5
}
}
}
]
}
}
}
}
}
''' ;
elastic:entities ?entity .
}
ORDER BY ?entity


The result looks like this:

?entity

urn:Eva

urn:John

#### Copy fields¶

Often, it is convenient to synchronize one and the same data multiple times with different settings to accommodate for different use cases, e.g., faceting or sorting vs full-text search. The Elasticsearch GraphDB Connector has explicit support for fields that copy their value from another field. This is achieved by specifying a single element in the property chain of the form @otherFieldName, where otherFieldName is another non-copy field. Take the following example:

...
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.w3.org/2000/01/rdf-schema#label"
],
"analyzed": true
},
{
"fieldName": "grapeFacet",
"propertyChain": [
"@grape"
],
"analyzed": false
}
]
...


The snippet creates an analyzed field “grape” and a non-analyzed field “grapeFacet”, both fields are populated with the same values and “grapeFacet” is defined as a copy field that refers to the field “facet”.

Note

The connector handles copy fields in a more optimal way than specifying a field with exactly the same property chain as another field.

#### Multiple property chains per field¶

Sometimes, you have to work with data models that define the same concept (in terms of what you want to index in Elasticsearch) with more than one property chain, e.g., the concept of “name” could be defined as a single canonical name, multiple historical names and some unofficial names. If you want to index these together as a single field in Elasticsearch, you can define this as a multiple property chains field.

Fields with multiple property chains are defined as a set of separate virtual fields that will be merged into a single physical field when indexed. Virtual fields are distinguished by the suffix $xyz, where xyz is any alphanumeric sequence of convenience.For example, we can define the fields name$1 and name$2 like this: ... "fields": [ { "fieldName": "name$1",
"propertyChain": [
"http://www.ontotext.com/example#canonicalName"
],

#### Indexing language tags¶

The language tag of an RDF literal can be indexed by specifying a property chain, where the last element is the pseudo-IRI lang(). The property preceding lang() must lead to a literal value. For example:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameLanguage",
"propertyChain": [
"http://www.ontotext.com/example#name",
"lang()"
]
}
],
}
''' .
}


The above connector will index the language tag of each literal value of the property http://www.ontotext.com/example#name into the field nameLanguage.

#### Indexing named graphs¶

The named graph of a given value can be indexed by ending a property chain with the special pseudo-URI graph(). Indexing the named graph of the value instead of the value itself allows searching by named graph.

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"fields": [
{
"fieldName": "name",
"propertyChain": [
"http://www.ontotext.com/example#name"
]
},
{
"fieldName": "nameGraph",
"propertyChain": [
"http://www.ontotext.com/example#name",
"graph()"
]
}
],
}
''' .
}


The above connector will index the named graph of each value of the property http://www.ontotext.com/example#name into the field nameGraph.

#### Wildcard literal indexing¶

In this mode, the last element of a property chain is a wildcard that will match any predicate that leads to a literal value. Use the special pseudo-IRI $literal as the last element of the property chain to activate it. Note Currently, it really means any literal, including literals with data types. For example: { "fields" : [ { "propertyChain" : [ "$literal" ],
"fieldName" : "name"
}, {
"propertyChain" : [ "http://example.com/description", "$literal" ], "fieldName" : "description" } ... }  See Indexing all literals for a detailed example. #### Indexing the IRI of an entity¶ Sometimes you may need the IRI of each entity (e.g., http://www.ontotext.com/example/wine#Franvino from our small example dataset) indexed as a regular field. This can be achieved by specifying a property chain with a single property referring to the pseudo-IRI $self. For example:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "entityId",
"propertyChain": [
"$self" ], }, { "fieldName": "grape", "propertyChain": [ "http://www.ontotext.com/example/wine#madeFromGrape", "http://www.w3.org/2000/01/rdf-schema#label" ] }, ] } ''' . }  The above connector will index the IRI of each wine into the field entityId. Note Note that GraphDB will also use the IRI of each entity as the ID of each document in Elasticsearch, which is represented by the field id. ## Datatype mapping¶ The Elasticsearch GraphDB Connector maps different types of RDF values to different types of Elasticsearch values according to the basic type of the RDF value (IRI or literal) and the datatype of literals. The auto-detection uses the following mapping: RDF value RDF datatype Elasticsearch type IRI n/a keyword literal any type not explicitly mentioned below text literal xsd:boolean boolean literal xsd:double double literal xsd:float float literal xsd:long long literal xsd:int integer literal xsd:dateTime date with format: strict_date_time literal xsd:date date with format: strict_date literal xsd:time date with format: strict_time_no_millis||strict_time literal xsd:gYear date with format: strict_year literal xsd:gYearMonth date with format: strict_year_month Note For any given field, the automatic mapping uses the first value it sees. This works fine for clean datasets but might lead to problems, if your dataset has non-normalized data, e.g., the first value has no datatype but other values have. It is therefore recommended to set datatype to a fixed value, e.g. xsd:date. ### Date and time conversion¶ RDF and Elasticsearch use slightly different models for representing dates and times, even though the values might look very similar. Years in RDF values use the XSD format and are era years, where positive values denote the common era and negative values denote years before the common era. There is no year zero. Years in Elasticsearch use the ISO format and are proleptic years, i.e., positive values denote years from the common era with any previous eras just going down by one mathematically so there is year zero. In short: • year 2020 CE = year 2020 in XSD = year 2020 in ISO. • year 1 CE = year 1 in XSD = year 1 in ISO. • year 1 BCE = year -1 in XSD = year 0 in ISO. • year 2 BCE = year -2 in XSD = year -1 in ISO. All years coming from RDF literals will be converted to ISO before indexing in Elasticsearch. Both XSD and ISO date and time values support timezones. In addition to that, XSD defines the lack of a timezone as undetermined. Since we do not want to have any undetermined state in the indexing system, we define the undetermined time zone as UTC, i.e., "2020-02-14T12:00:00"^^xsd:dateTime is equivalent to "2020-02-14T12:00:00Z"^^xsd:dateTime (Z is the UTC timezone, also known as +00:00). Also note that XSD dates and partial dates, e.g., xsd:gYear values, may have a timezone, which leads to additional complications. E.g., "2020+02:00"^^xsd:gYear (the year 2020 in the +02:00 timezone) will be normalized to 2019-12-31T22:00:00Z (the previous year!) if strict timezone adherence is followed. We have chosen to ignore the timezone on any values that do not have an associated time value, e.g.: • "2020-02-15+02:00"^^xsd:date • "2020-02+02:00"^^xsd:gYearMonth • "2020+02:00"^^xsd:gYear All of the above will be treated as if they specified UTC as their timezone. ## Entity filtering¶ The Elasticsearch connector supports four kinds of entity filters used to fine-tune the set of entities and/or individual values for the configured fields, based on the field value. Entities and field values are synchronized to Elasticsearch if, and only if, they pass the filter. The filters are similar to a FILTER() inside a SPARQL query but not exactly the same. In them, each configured field can be referred to by prefixing it with a ?, much like referring to a variable in SPARQL. ### Types of filters¶ Top-level value filter The top-level value filter is specified via valueFilter. It is evaluated prior to anything else when only the document ID is known and it may not refer to any field names but only to the special field $this that contains the current document ID. Failing to pass this filter removes the entire document early in the indexing process and it can be used to introduce more restrictions similar to the built-in filtering by type via the types property.

Top-level document filter

The top-level document filter is specified via documentFilter. This filter is evaluated last when all of the document has been collected and it decides whether to include the document in the index. It can be used to enforce global document restrictions, e.g., certain fields are required or a document needs to be indexed only if a certain field value meets specific conditions.

Per-field value filter

The per-field value filter is specified via valueFilter inside the field definition of the field whose values are to be filtered. The filter is evaluated while collecting the data for the field when each field value becomes available.

The variable that contains the field value is $this. Other field names can be used to filter the current field’s value based on the value of another field, e.g., $this > ?age will compare the current field value to the value of the field age (see also Two-variable filtering). Failing to pass the filter will remove the current field value.

On nested documents, the per-field value filter can be used to remove the entire nested document early in the indexing process, e.g., by checking the type of the nested document via next hop with rdf:type.

Nested document filter

The nested document filter is specified via documentFilter inside the field definition of the field that defines the root of a nested document. The filter is evaluated after the entire nested document has been collected. Failing to pass this filter removes the entire nested document.

Inside a nested document filter, the field names are within the context of the nested document and not within the context of the top-level document. For example, if we have a field children that defines a nested document, and we use a filter like ?age < "10"^^xsd:int, we will be referring to the field children.age. We can use the prefix $outer. one or more times to refer to field values from the outer document (from the viewpoint of the nested document). For example, $outer.age > "25"^^xsd:int will refer to the age field that is a sibling of the children field.

Other than the above differences, the nested document filter is equivalent to the top-level document filter from the viewpoint of the nested document.

### Filter operators¶

The filter operators are used to test if the value of a given field satisfies some condition:

Operator

Meaning

?var in (value1, value2, ...)

Tests if the field var’s value is one of the specified values. Values are compared strictly unlike the similar SPARQL operator, i.e. for literals to match their datatype must be exactly the same (similar to how SPARQL sameTerm works). Values that do not match, are treated as if they were not present in the repository.

Example:
?status in ("active", "new")

?var not in (value1, value2, ...)

The negated version of the in-operator.

Example:
?status not in ("archived")

bound(?var)

Tests if the field var has a valid value. This can be used to make the field compulsory.

Example:
bound(?name)

isExplicit(?var)

Tests if the field var’s value came from an explicit statement. This will use the last element of the property chain. If you need to assert the explicit status of a previous property chain use parent(?var) as many times as needed.

Example:
isExplicit(?name)
?var = value (equal to)
?var != value (not equal to)
?var > value (greater than)
?var >= value (greater than or equal to)
?var < value (less than)
?var <= value (less than or equal to)
RDF value comparison operators that compare RDF values similarly to the equivalent SPARQL operators. The field var’s value will be compared to the specified RDF value. When comparing RDF values that are literals, their datatypes must be compatible, e.g., xsd:integer and xsd:long but not xsd:string and xsd:date. Values that do not match are treated as if they were not present in the repository.
Examples:
Given that height’s value is "150"^^xsd:int and dateOfBirth’s value is "1989-12-31"^^xsd:date, then:
?height = "150"^^xsd:int is true
?height = "150"^^xsd:long is true
?height = "150" is false

?height != "151"^^xsd:int is true
?height != "150" is true

?height > "150"^^xsd:int is false
?height >= "150"^^xsd:int is true
?dateOfBirth < "1990-01-01"^^xsd:date is true

regex(?var, "pattern")

or

regex(?var, "pattern", "i")

Tests if the field var’s value matches the given regular expression pattern.
If the “i” flag option is present, this indicates that the match operates in case-insensitive mode.
Values that do not match are treated as if they were not present in the repository.
Example:
regex(?name, "^mrs?", "i")

expr1 || expr2

or

expr1 or expr2

Logical disjunction of expressions expr1 and expr2.

Examples:
bound(?name) || bound(?company)
bound(?name) or bound(?company)

expr1 && expr2

or

expr1 and expr2

Logical conjunction of expressions expr1 and expr2.

Examples:
bound(?status) && ?status in ("active", "new")
bound(?status) and ?status in ("active", "new")

!expr

Logical negation of expression expr.

Example:
!bound(?company)

( expr )

Grouping of expressions

Example:
(bound(?name) or bound(?company)) && bound(?address)

### Filter modifiers¶

In addition to the operators, there are some constructions that can be used to write filters based not on the values of a field but on values related to them:

Accessing the previous element in the chain

The construction parent(?var) is used for going to a previous level in a property chain. It can be applied recursively as many times as needed, e.g., parent(parent(parent(?var))) goes back in the chain three times. The effective value of parent(?var) can be used with the in or not in operator like this: parent(?company) in (<urn:a>, <urn:b>), or in the bound operator like this: parent(bound(?var)).

Accessing an element beyond the chain

The construction ?var -> uri (alternatively, ?var o uri or just ?var uri) is used to access additional values that are accessible through the property uri. In essence, this construction corresponds to the triple pattern value uri ?effectiveValue, where ?value is a value bound by the field var. The effective value of ?var -> uri can be used with the in or not in operator like this: ?company -> rdf:type in (<urn:c>, <urn:d>). It can be combined with parent() like this: parent(?company) -> rdf:type in (<urn:c>, <urn:d>). The same construction can be applied to the bound operator like this: bound(?company -> <urn:hasBranch>), or even combined with parent() like this: bound(parent(?company) -> <urn:hasGroup>).

The IRI parameter can be a full IRI within < > or the special string rdf:type (alternatively, just type), which will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

Filtering by RDF graph

The construction graph(?var) is used for accessing the RDF graph of a field’s value. A typical use case is to sync only explicit values: graph(?a) not in (<http://www.ontotext.com/implicit>) but using isExplicit(?x) is the recommended way.

The construction can be combined with parent() like this: graph(parent(?a)) in (<urn:a>).

Filtering by language tags

The construction lang(?var) is used for accessing the language tag of field’s value (only RDF literals can have a language tag). The typical use case is to sync only values written in a given language: lang(?a) in ("de", "it", "no"). The construction can be combined with parent() and an element beyond the chain like this: lang(parent(?a) -> <http://www.w3.org/2000/01/rdf-schema#label>) in ("en", "bg"). Literal values without language tags can be filtered by using an empty tag: "".

Current context variable $this The special field variable $this (and not ?this, ?$this, $?this) is used to refer to the current context. In the top-level value filter and the top-level document filter it refers to the document ID. In the per-field value filter it refers to the currently filtered field value. In the nested document filter it refers to the nested document ID.

ALL() quantifier

In the context of document-level filtering, a match is true if at least one of potentially many field values match, e.g., ?location = <urn:Europe> would return true if the document contains { "location": ["<urn:Asia>", "<urn:Europe>"] }.

In addition to this, you can also use the ALL() quantifier when you need all values to match, e.g., ALL(?location) = <urn:Europe> would not match with the above document because <urn:Asia> does not match.

Entity filters and default values

Entity filters can be combined with default values in order to get more flexible behavior.

A typical use-case for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as deleted by the presence of a specific value for a given property.

### Two-variable filtering¶

Besides comparing a field value to one or more constants or running an existential check on the field value, some use cases also require comparing the field value to the value of another field in order to produce the desired result. GraphDB solves this by supporting two-variable filtering in the per-field value filter, the top-level document filter, and the nested document filter.

Note

This type of filtering is not possible in the top-level value filter because the only variable that is available there is $this. In the top-level document filter and the nested document filter, there are no restrictions as all values are available at the time of evaluation. In the per-field value filter, two-variable filtering will reorder the defined fields such that values for other fields are already available when the current field’s filter is evaluated. For example, let’s say we defined a filter $this > ?salary for the field price. This will force the connector to process the field salary first, apply its per-field value filter if any, and only then start collecting and filtering the values for the field price.

Cyclic dependencies will be detected and reported as an invalid filter. For example, if in addition to the above we define a per-field value filter ?price > "1000"^^xsd:int for the field salary, a cyclic dependency will be detected as both price and salary will require the other field being indexed first.

### Basic entity filter example¶

Given the following RDF data:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix example: <http://www.ontotext.com/example#> .

# the entity below will be synchronised because it has a matching value for city: ?city in ("London")
example:alpha
example:name "John Synced" ;
example:city "London" .

# the entity below will not be synchronised because it lacks the property completely: bound(?city)
example:beta
example:name "Peter Syncfree" .

# the entity below will not be synchronized because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
example:gamma
example:name "Mary Syncless" ;
example:city "Liverpool" .


If you create a connector instance such as:

PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#>
PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#>

INSERT DATA {
elastic-index:my_index elastic:createConnector '''
{
"elasticsearchNode": "localhost:9200",
"fields": [
{
"fieldName": "name",
"propertyChain": ["http://www.ontotext.com/example#name"]
},
{
"fieldName": "city",
"propertyChain": ["http://www.ontotext.com/example#city"],
"valueFilter": "$this = \\"London\\"" } ], "documentFilter": "bound(?city)" } ''' . }  The entity :beta is not synchronized as it has no value for city. To handle such cases, you can modify the connector configuration to specify a default value for city: ... { "fieldName": "city", "propertyChain": ["http://www.ontotext.com/example#city"], "defaultValue": "London" } ... }  The default value is used for the entity :beta as it has no value for city in the repository. As the value is “London”, the entity is synchronized. ### Advanced entity filter example¶ Sometimes, data represented in RDF is not well suited to map directly to non-RDF. For example, if you have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model this is a single property :taggedWith. Consider the following RDF data: @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix example2: <http://www.ontotext.com/example2#> . example2:Berlin rdf:type example2:Location ; rdfs:label "Berlin" . example2:Mozart rdf:type example2:Person ; rdfs:label "Wolfgang Amadeus Mozart" . example2:Einstein rdf:type example2:Person ; rdfs:label "Albert Einstein" . example2:Cannes-FF rdf:type example2:Event ; rdfs:label "Cannes Film Festival" . example2:Article1 rdf:type example2:Article ; rdfs:comment "An article about a film about Einstein's life while he was a professor in Berlin." ; example2:taggedWith example2:Berlin ; example2:taggedWith example2:Einstein ; example2:taggedWith example2:Cannes-FF . example2:Article2 rdf:type example2:Article ; rdfs:comment "An article about Berlin." ; example2:taggedWith example2:Berlin . example2:Article3 rdf:type example2:Article ; rdfs:comment "An article about Mozart's life." ; example2:taggedWith example2:Mozart . example2:Article4 rdf:type example2:Article ; rdfs:comment "An article about classical music in Berlin." ; example2:taggedWith example2:Berlin ; example2:taggedWith example2:Mozart . example2:Article5 rdf:type example2:Article ; rdfs:comment "A boring article that has no tags." . example2:Article6 rdf:type example2:Article ; rdfs:comment "An article about the Cannes Film Festival in 2013." ; example2:taggedWith example2:Cannes-FF .  Now, if you map this data to Elasticsearch so that the property :taggedWithx is mapped to separate fields taggedWithPerson and taggedWithLocation according to the type of x (we are not interested in events), you can map taggedWith twice to different fields and then use an entity filter to get the desired values: PREFIX elastic: <http://www.ontotext.com/connectors/elasticsearch#> PREFIX elastic-index: <http://www.ontotext.com/connectors/elasticsearch/instance#> INSERT DATA { elastic-index:my_index elastic:createConnector ''' { "elasticsearchNode": "localhost:9200", "types": ["http://www.ontotext.com/example2#Article"], "fields": [ { "fieldName": "comment", "propertyChain": ["http://www.w3.org/2000/01/rdf-schema#comment"] }, { "fieldName": "taggedWithPerson", "propertyChain": ["http://www.ontotext.com/example2#taggedWith"], "valueFilter": "$this -> type in (<http://www.ontotext.com/example2#Person>)"
},
{
"fieldName": "taggedWithLocation",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"],

• Per-field rule on field type: $this IN (<urn:Foo>, <urn:Bar>) • Top-level document filter: BOUND(?location) ## Overview of connector predicates¶ The following diagram shows a summary of all predicates that can administrate (create, drop, check status) connector instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown in green, blank helper nodes are shown in blue, literals in red, and IRIs in orange. The predicates are represented by labeled arrows. ## Caveats¶ ### Order of control¶ Even though SPARQL per se is not sensitive to the order of triple patterns, the Elasticsearch GraphDB Connector expects to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results. The diagram in Overview of connector predicates provides a quick overview of the predicates. ## Upgrading from previous versions¶ ### Migrating from GraphDB 9.x¶ GraphDB 10.0 introduces major changes to the filtering mechanism of the connectors. Existing connector instances will not be usable and attempting to use them for queries or updates will throw an error. If your GraphDB 9.x (or older) connector definitions do not include an entity filter, you can simply repair them. If your GraphDB 9.x (or older) connector definitions do include an entity filter with the entityFilter option, you need to rewrite the filter with one of the current filter types: 1. Save your existing connector definition. 2. Drop the connector instance. 3. In general, most older connector filters can be easily rewritten using the per-field value filter and top-level document filter. Rewrite the filters as follows: Rule of thumb: • If you want to remove individual values, i.e., if the operand is not BOUND() –-> rewrite with per-field value filter. • If you want to remove entire documents, i.e., if the operand is BOUND() –> rewrite with top-level document filter. So if we take the example: ?location = <urn:Europe> AND BOUND(?location) AND ?type IN (<urn:Foo>, <urn:Bar>)  It needs to be rewritten like this: • Per-field rule on field location: $this = <urn:Europe>

• Per-field rule on field type: \$this IN (<urn:Foo>, <urn:Bar>)

• Top-level document filter: BOUND(?location)`

4. Recreate the connector instance using the new definition.