Solr GraphDB connector

Overview and features

The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically implemented by an external component or a service such as Solr but have the additional benefit of staying automatically up-to-date with the GraphDB repository data.

The Connectors provide synchronization at the entity level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains. A property chain is defined as a sequence of triples where each triple’s object is the subject of the following triple.

The main features of the GraphDB Connectors are:

  • maintaining an index that is always in sync with the data stored in GraphDB;

  • multiple independent instances per repository;

  • the entities for synchronization are defined by:

    • a list of fields (on the Solr side) and property chains (on the GraphDB side) whose values will be synchronized;

    • a list of rdf:type’s of the entities for synchronization;

    • a list of languages for synchronization (the default is all languages);

    • additional filtering by property and value.

  • full-text search using native Solr queries;

  • snippet extraction: highlighting of search terms in the search result;

  • faceted search;

  • sorting by any preconfigured field;

  • paging of results using offset and limit;

  • custom mapping of RDF types to Solr types;

Each feature is described in detail below.

Usage

All interactions with the Solr GraphDB Connector shall be done through SPARQL queries.

There are three types of SPARQL queries:

  • INSERT for creating and deleting connector instances;

  • SELECT for listing connector instances and querying their configuration parameters;

  • INSERT/SELECT for storing and querying data as part of the normal GraphDB data workflow.

In general, this corresponds to INSERT adds or modifies data and SELECT queries existing data.

Each connector implementation defines its own URI prefix to distinguish it from other connectors. For the Solr GraphDB Connector, this is http://www.ontotext.com/connectors/solr#. Each command or predicate executed by the connector uses this prefix, e.g., http://www.ontotext.com/connectors/solr#createConnector to create a connector instance for Solr.

Individual instances of a connector are distinguished by unique names that are also URIs. They have their own prefix to avoid clashing with any of the command predicates. For Solr, the instance prefix is http://www.ontotext.com/connectors/solr/instance#.

Sample data

All examples use the following sample data, which describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. The minimum required ruleset level in GraphDB is RDFS.

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix : <http://www.ontotext.com/example/wine#> .

:RedWine rdfs:subClassOf :Wine .
:WhiteWine rdfs:subClassOf :Wine .
:RoseWine rdfs:subClassOf :Wine .

:Merlo
    rdf:type :Grape ;
    rdfs:label "Merlo" .

:CabernetSauvignon
    rdf:type :Grape ;
    rdfs:label "Cabernet Sauvignon" .

:CabernetFranc
    rdf:type :Grape ;
    rdfs:label "Cabernet Franc" .

:PinotNoir
    rdf:type :Grape ;
    rdfs:label "Pinot Noir" .

:Chardonnay
    rdf:type :Grape ;
    rdfs:label "Chardonnay" .

:Yoyowine
    rdf:type :RedWine ;
    :madeFromGrape :CabernetSauvignon ;
    :hasSugar "dry" ;
    :hasYear "2013"^^xsd:integer .

:Franvino
    rdf:type :RedWine ;
    :madeFromGrape :Merlo ;
    :madeFromGrape :CabernetFranc ;
    :hasSugar "dry" ;
    :hasYear "2012"^^xsd:integer .

:Noirette
    rdf:type :RedWine ;
    :madeFromGrape :PinotNoir ;
    :hasSugar "medium" ;
    :hasYear "2012"^^xsd:integer .

:Blanquito
    rdf:type :WhiteWine ;
    :madeFromGrape :Chardonnay ;
    :hasSugar "dry" ;
    :hasYear "2012"^^xsd:integer .

:Rozova
    rdf:type :RoseWine ;
    :madeFromGrape :PinotNoir ;
    :hasSugar "medium" ;
    :hasYear "2013"^^xsd:integer .

Setup and maintenance

Prerequisites

Solr core creation

To create new Solr cores on the fly, you have to use the custom admin handler provided with the Solr Connector.

  1. Copy the solr-core-admin-handler.jar file from the folder tools of the GraphDB distribution to your Solr home.

  2. Tell Solr to scan the .jar and use the GraphDB custom handler instead of the default one. Add it to the root solr tag in solr.xml in your Solr home:

<str name="adminHandler">com.ontotext.solr.handler.admin.GraphDBConnectorAdminHandler</str>
<str name="sharedLib">${sharedLib:}</str>

To start Solr with custom solr-home:

solr-7.2.1/bin/solr start -p 8934 -s /path_to_solr-home
Solr schema setup

To use the connector, the core’s schema from which the configuration will be copied (most of the time named collection1) must be configured to allow schema modifications. See “Managed Schema Definition in SolrConfig” on page 409 of the Apache Solr Reference Guide.

A good starting point is the configuration from example-schemaless in the Solr distribution.

Third-party component versions

This version of the Solr GraphDB Connector uses Solr version 8.6.3.

Creating a connector instance

Creating a connector instance is done by sending a SPARQL query with the following configuration data:

  • the name of the connector instance (e.g., my_index);

  • a Solr instance to synchronize to;

  • classes to synchronize;

  • properties to synchronize.

The configuration data has to be provided as a JSON string representation and passed together with the create command.

If you create the connector via the Workbench, no matter which way you use, you will be presented with a pop-up screen showing you the connector creation progress.

Using the Workbench

  1. Go to Setup -> Connectors.

  2. Click the New Connector button in the tab of the respective Connector type you want to create.

  3. Fill in the configuration form.

  4. Execute the CREATE statement from the form by clicking OK. Alternatively, you can view its SPARQL query by clicking View SPARQL Query, and then copy it to execute it manually or integrate it in automation scripts.

_images/create-connector-solr.png

Using the create command

The create command is triggered by a SPARQL INSERT with the createConnector predicate, e.g., it creates a connector instance called my_index, which synchronizes the wines from the sample data above:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
    inst:my_index :createConnector '''
{
  "solrUrl": "http://localhost:8983/solr",
  "types": [
    "http://www.ontotext.com/example/wine#Wine"
  ],
  "fields": [
    {
      "fieldName": "grape",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#madeFromGrape",
        "http://www.w3.org/2000/01/rdf-schema#label"
      ]
    },
    {
      "fieldName": "sugar",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#hasSugar"
      ],
      "analyzed": false,
      "multivalued": false
    },
    {
      "fieldName": "year",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#hasYear"
      ],
      "analyzed": false
    }
  ]
}
''' .
}

Note

One of the fields has "multivalued": false. This is explained further under Sorting.

The above command creates a new Solr connector instance that connects to the Solr instance accessible at port 8983 on the localhost as specified by the "solrUrl" key.

The "types" key defines the RDF type of the entities to synchronize and, in the example, it is only entities of the type http://www.ontotext.com/example/wine#Wine (and its subtypes). The "fields" key defines the mapping from RDF to Solr. The basic building block is the property chain, i.e., a sequence of RDF properties where the object of each property is the subject of the following property. In the example, three bits of information are mapped - the grape the wines are made of, sugar content, and year. Each chain is assigned a short and convenient field name: “grape”, “sugar”, and “year”. The field names are later used in the queries.

Grape is an example of a property chain composed of more than one property. First, we take the wine’s madeFromGrape property, the object of which is an instance of the type Grape, and then we take the rdfs:label of this instance. Sugar and year are both composed of a single property that links the value directly to the wine.

The fields sugar and year contain discrete values, such as medium, dry, 2012, 2013, and thus it is best to specify the option analyzed: false as well. See analyzed in Defining fields for more information.

Schema and core management

By default, GraphDB manages (create, delete or update if needed) the Solr core and the Solr schema. This makes it easier to use Solr as everything is done automatically. This behavior can be changed by the following options:

  • manageCore: if true, GraphDB manages the core. true by default.

  • manageSchema: if true, GraphDB manages the schema. true by default.

The automatic core management requires the custom Solr admin handler provided with the GraphDB distribution. For more information, see Solr core creation.

Note

If either of the options is set to false, you have to create, update or remove the core/schema manually and, in case Solr is misconfigured, the connector instance will not function correctly.

Using a non-managed schema

The present version provides no support for changing some advanced options, such as stopwords, on a per field basis. The recommended way to do this for now is to manage the schema yourself and tell the connector to just sync the object values in the appropriate fields. Here is an example:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
    inst:my_index :createConnector '''
{
  "solrUrl": "http://localhost:8983/solr",
  "types": [
    "http://www.ontotext.com/example/wine#Wine"
  ],
  "fields": [
    {
      "fieldName": "grape",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#madeFromGrape",
        "http://www.w3.org/2000/01/rdf-schema#label"
      ]
    },
    {
      "fieldName": "sugar",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#hasSugar"
      ],
      "analyzed": false,
      "multivalued": false
    },
    {
      "fieldName": "year",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#hasYear"
      ],
      "analyzed": false
    }
  ],
  "manageSchema": "false"
}
''' .
}

This creates the same connector instance as above but it expects fields with the specified fieldnames to be already present in the core as well as some internal GraphDB fields. For the example, you must have the following fields:

Field name

Solr config

_graphdb_id

<field name="_graphdb_id" type="string" indexed="true" stored="true" required="true" multiValued="false"/>

grape

<field name="grape" type="text_general" indexed="true" stored="true" multiValued="true"/>

sugar

<field name="sugar" type="text_general" indexed="true" stored="true" multiValued="false"/>

year

<field name="year" type="tints" indexed="true" stored="true" multiValued="true"/>

_graphdb_id is used internally by GraphDB and is always required.

Working with secured Solr

GraphDB allows the access of a secured Solr instance by passing the arbitrary parameters.

To setup basic user authentication configuration in GraphDB Solr Connector you need to configure solrBasicAuthUser and solrBasicAuthPassword parameters.

...
      inst:my_index conn:createConnector '''
    {
        "hasProperty": "http://www.w3.org/2000/01/rdf-schema#comment",
        "solrUrl": "${validSolrUrl}",
        "solrUrl": "http://localhost:9090/solr",
        "solrBasicAuthUser": "solr",
        "solrBasicAuthPassword": "SolrRocks",
        "fields": [
...

When you create a new Sorl Connector in GraphDB Workbench you need to add values for Solr basic auth user and Solr basic auth password options.

For more information about securing Solr see the documentation Solr: Enable Basic Authentication

Dropping a connector instance

Dropping a connector instance removes all references to its external store from GraphDB as well as the Solr core associated with it.

The drop command is triggered by a SPARQL INSERT with the dropConnector predicate where the name of the connector instance has to be in the subject position, e.g., this removes the connector my_index:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
  inst:my_index :dropConnector "" .
}

You can also force drop a connector in case a normal delete does not work. The force delete will remove the connector even if part of the operation fails. Go to Setup -> Connectors where you will see the already existing connectors that you have created. Click the Delete icon, and check Force delete in the dialog box.

_images/connectors_force_delete.png

Retrieving the create options for a connector instance

You can view the options string that was used to create a particular connector instance with the following query:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?createString {
  inst:my_index :listOptionValues ?createString .
}

Listing available connector instances

In the Connectors management view

Existing Connector instances show under Existing connectors (below the New Connector button). Click the name of an instance to view its configuration and SPARQL query, or click the repair / delete icons to perform these operations.

_images/connectors.png

With a SPARQL query

Listing connector instances returns all previously created instances. It is a SELECT query with the listConnectors predicate:

PREFIX : <http://www.ontotext.com/connectors/solr#>

SELECT ?cntUri ?cntStr {
  ?cntUri :listConnectors ?cntStr .
}

?cntUri is bound to the prefixed URI of the connector instance that was used during creation, e.g., http://www.ontotext.com/connectors/solr/instance#my_index>, while ?cntStr is bound to a string, representing the part after the prefix, e.g., "my_index".

Instance status check

The internal state of each connector instance can be queried using a SELECT query and the connectorStatus predicate:

PREFIX : <http://www.ontotext.com/connectors/solr#>

SELECT ?cntUri ?cntStatus {
  ?cntUri :connectorStatus ?cntStatus .
}

?cntUri is bound to the prefixed URI of the connector instance, while ?cntStatus is bound to a string representation of the status of the connector represented by this URI. The status is key-value based.

Working with data

Adding, updating, and deleting data

From the user point of view, all synchronization happens transparently without using any additional predicates or naming a specific store explicitly, i.e., you must simply execute standard SPARQL INSERT/DELETE queries. This is achieved by intercepting all changes in the plugin and determining which abstract documents need to be updated.

Simple queries

Once a connector instance has been created, it is possible to query data from it through SPARQL. For each matching abstract document, the connector instance returns the document subject. In its simplest form, querying is achieved by using a SELECT and providing the Solr query as the object of the query predicate:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity {
  ?search a inst:my_index ;
      :query "grape:cabernet" ;
      :entities ?entity .
}

The result binds ?entity to the two wines made from grapes that have “cabernet” in their name, namely :Yoyowine and :Franvino.

Note

You must use the field names you chose when you created the connector instance. They can be identical to the property URIs but you must escape any special characters according to what Solr expects.

  1. Get a query instance of the requested connector instance by using the RDF notation "X a Y" (= X rdf:type Y), where X is a variable and Y is a connector instance URI. X is bound to a query instance of the connector instance.

  2. Assign a query to the query instance by using the system predicate :query.

  3. Request the matching entities through the :entities predicate.

It is also possible to provide per query search options by using one or more option predicates. The option predicates are described in detail below.

Raw queries

To access a Solr query parameter that is not exposed through a special predicate, use a raw query. Instead of providing a full text query in the :query part, specify raw Solr parameters. For example, to sort the facets in a different order than described in facet.sort, execute the following query:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity {
  ?search a inst:my_index ;
      :query '''
      {
          "facet":"true",
          "indent":"true",
          "facet.sort":"index",
          "q":"*:*",
          "wt":"json"
      }
      ''' ;
      :entities ?entity .
}

You can get these parameters when you do your query from the admin interface in Solr, or from the response payload (where they are included). The query parameters from the select endpoint are also supported in Solr. Here is an example:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity {
  ?search a inst:my_index ;
      :query '''q=*%3A*&wt=json&indent=true&facet=true&facet.sort=index''' ;
      :entities ?entity .
}

Note

You have to specify q= as the first parameter as it is used for detecting the raw query.

Combining Solr results with GraphDB data

The bound ?entity can be used in other SPARQL triples in order to build complex queries that fetch additional data from GraphDB, for example, to see the actual grapes in the matching wines as well as the year they were made:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>
PREFIX wine: <http://www.ontotext.com/example/wine#>

SELECT ?entity ?grape ?year {
  ?search a inst:my_index ;
      :query "grape:cabernet" ;
      :entities ?entity .
  ?entity wine:madeFromGrape ?grape .
  ?entity wine:hasYear ?year
}

The result looks like this:

?entity

?grape

?year

:Yoyowine

:CabernetSauvignon

2013

:Franvino

:Merlo

2012

:Franvino

:CabernetFranc

2012

Note

:Franvino is returned twice because it is made from two different grapes, both of which are returned.

Entity match score

It is possible to access the match score returned by Solr with the score predicate. As each entity has its own score, the predicate should come at the entity level. For example:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity ?score {
  ?search a inst:my_index ;
      :query "grape:cabernet" ;
      :entities ?entity .
  ?entity :score ?score
}

The result looks like this but the actual score might be different as it depends on the specific Solr version:

?entity

?score

:Yoyowine

0.9442660212516785

:Franvino

0.7554128170013428

Basic facet queries

Consider the sample wine data and the my_index connector instance described previously. You can also query facets using the same instance:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?facetName ?facetValue ?facetCount WHERE {
  # note empty query is allowed and will just match all documents, hence no :query
  ?r a inst:my_index ;
    :facetFields "year,sugar" ;
    :facets _:f .
  _:f :facetName ?facetName .
  _:f :facetValue ?facetValue .
  _:f :facetCount ?facetCount .
}

It is important to specify the facet fields by using the facetFields predicate. Its value is a simple comma-delimited list of field names. In order to get the faceted results, use the facets predicate. As each facet has three components (name, value and count), the facets predicate binds a blank node, which in turn can be used to access the individual values for each component through the predicates facetName, facetValue, and facetCount.

The resulting bindings look like the following:

facetName

facetValue

facetCount

year

2012

3

year

2013

2

sugar

dry

3

sugar

medium

2

You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of the wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are the same as the three dry wines as each facet is computed independently.

Tip

Faceting by analysed textual field works but might produce unexpected results. Analysed textual fields are composed of tokens and faceting uses each token to create a faceting bucket. For example, “North America” and “Europe” produce three buckets: “north”, “america” and “europe”, corresponding to each token in the two values. If you need to facet by a textual field and still do full-text search on it, it is best to create a copy of the field with the setting "analyzed": false. For more information, see Copy fields.

Advanced facet and aggregation queries

While basic faceting allows for simple counting of documents based on the discrete values of a particular field, there are more complex faceted or aggregation searches in Solr. The Solr GraphDB Connector provides a mapping from Solr results to RDF results but no mechanism for specifying the queries other than executing a Raw queries.

Supported Solr facets and aggregations

The Solr GraphDB Connector supports mapping of range, interval and pivot facets.

Tip

For more information, refer to the documentation of Solr.

RDF mapping of the results

The results are accessed through the predicate aggregations (much like the basic facets are accessed through facets). The predicate binds multiple blank nodes that each contain a single aggregation bucket. The individual bucket items can be accessed through these predicates:

predicate

meaning

Solr counterpart

:name

Bucket name

getName()

:key

Key or value associated with the bucket

getValue() or getKey()

:count

Count of documents in the bucket

getCount()

:from

Start of range (RangeFacet)

getStart()

:to

End of range (RangeFacet)

getEnd()

:rangeGap

Gap of range (RangeFacet)

getGap()

:beforeCount

Count of documents before the first range (RangeFacet)

getBefore()

:afterCount

Count of documents after the first range (RangeFacet)

getAfter()

:betweenCount

Count of documents within all ranges (RangeFacet)

getBetween()

:parent

Pivot facets: points to the parent (upper level) blank node

:level

Pivot facets: level number where 1 is the uppermost level and the following levels are 2, 3 and so on

:levelName

Pivot facets: level name

getField()

Sorting

It is possible to sort the entities returned by a connector query according to one or more fields. Sorting is achieved by the orderBy predicate the value of which is a comma-delimited list of fields. Each field can be prefixed with a minus to indicate sorting in descending order. For example:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity {
  ?search a inst:my_index ;
      :query "year:2013" ;
      :orderBy "-sugar" ;
      :entities ?entity .
}

The result contains wines produced in 2013 sorted according to their sugar content in descending order:

entity

Rozova

Yoyowine

By default, entities are sorted according to their matching score in descending order.

Note

If you join the entity from the connector query to other triples stored in GraphDB, GraphDB might scramble the order. To remedy this, use ORDER BY from SPARQL.

Tip

Sorting by an analysed textual field works but might produce unexpected results. Analysed textual fields are composed of tokens and sorting uses the least (in the lexicographical sense) token. For example, “North America” will be sorted before “Europe” because the token “america” is lexicographically smaller than the token “europe”. If you need to sort by a textual field and still do full-text search on it, it is best to create a copy of the field with the setting "analyzed": false. For more information, see Copy fields.

Note

Solr imposes an additional requirement on fields used for sorting. They must be defined with multivalued = false.

Limit and offset

Limit and offset are supported on the Solr side of the query. This is achieved through the predicates limit and offset. Consider this example in which an offset of 1 and a limit of 1 are specified:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity {
  ?search a inst:my_index ;
      :query "sugar:dry" ;
      :offset "1" ;
      :limit "1" ;
      :entities ?entity .
}

The result contains a single wine, Franvino. If you execute the query without the limit and offset, Franvino will be second in the list:

entity

Yoyowine

Franvino

Blanquito

Note

The specific order in which GraphDB returns the results depends on how Solr returns the matches, unless sorting is specified.

Snippet extraction

Snippet extraction is used for extracting highlighted snippets of text that match the query. The snippets are accessed through the dedicated predicate snippets. It binds a blank node that in turn provides the actual snippets via the predicates snippetField and snippetText. The predicate snippets must be attached to the entity, as each entity has a different set of snippets. For example, in a search for Cabernet:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?entity ?snippetField ?snippetText {
  ?search a inst:my_index ;
      :query "grape:cabernet" ;
      :entities ?entity .
  ?entity :snippets _:s .
  _:s :snippetField ?snippetField ;
     :snippetText ?snippetText .
}

the query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:

?entity

?snippetField

?snippetText

:Yoyowine

grape

<em>Cabernet</em> Sauvignon

:Franvino

grape

<em>Cabernet</em> Franc

Note

The actual snippets might be different as this depends on the specific Solr implementation.

It is possible to tweak how the snippets are collected/composed by using the following option predicates:

  • :snippetSize - sets the maximum size of the extracted text fragment, 250 by default;

  • :snippetSpanOpen - text to insert before the highlighted text, <em> by default;

  • :snippetSpanClose - text to insert after the highlighted text, </em> by default.

The option predicates are set on the query instance, much like the :query predicate.

Total hits

You can get the total number of hits by using the totalHits predicate, e.g., for the connector instance my_index and a query that retrieves all wines made in 2012:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?totalHits {
    ?r a inst:my_index ;
       :query "year:2012" ;
       :totalHits ?totalHits .
}

As there are three wines made in 2012, the value 3 (of type xdd:long) binds to ?totalHits.

List of creation parameters

The creation parameters define how a connector instance is created by the :createConnector predicate. Some are required and some are optional. All parameters are provided together in a JSON object, where the parameter names are the object keys. Parameter values may be simple JSON values such as a string or a boolean, or they can be lists or objects.

All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.

solrUrl (URL), required, Solr instance to sync to

As Solr is a third-party service, you have to specify the URL on which it is running. The format of the URL is of the form http://hostname.domain:port/. There is no default value.

bulkUpdateBatchSize (integer), controls the maximum number of documents sent per bulk request.

Default value is 1,000.

types (list of URI), required, specifies the types of entities to sync

The RDF types of entities to sync are specified as a list of URIs. At least one type URI is required.

languages (list of string), optional, valid languages for literals

RDF data is often multilingual but you can map only some of the languages represented in the literal values. This can be done by specifying a list of language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of language ranges maps all existing literals that have matching language tags.

fields (list of field object), required, defines the mapping from RDF to Solr

The fields define exactly what parts of each entity will be synchronized as well as the specific details on the connector side. The field is the smallest synchronization unit and it maps a property chain from GraphDB to a field in Solr. The fields are specified as a list of field objects. At least one field object is required. Each field object has further keys that specify details.

  • fieldName (string), required, the name of the field in Solr

    The name of the field defines the mapping on the connector side. It is specified by the key fieldName with a string value. The field name is used at query time to refer to the field. There are few restrictions on the allowed characters in a field name but to avoid unnecessary escaping (which depends on how Solr parses its queries), we recommend to keep the field names simple.

  • propertyChain (list of URI), required, defines the property chain to reach the value

    The property chain (propertyChain) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs where at least one URI must be provided.

    The URI of the document will be synchronized to the special field "id" in Solr. You may use it to query Solr directly and retrieve the matching entity URI.

    See Copy fields for defining multiple fields with the same property chain.

    See Multiple property chains per field for defining a field whose values are populated from more than one property chain.

    See Indexing language tags for defining a field whose values are populated with the language tags of literals.

    See Indexing the URI of an entity for defining a field whose values are populated with the URI of the indexed entity.

  • defaultValue (string), optional, specifies a default value for the field

    The default value (defaultValue) provides means for specifying a default value for the field when the property chain has no matching values in GraphDB. The default value can be a plain literal, a literal with a datatype (xsd: prefix supported), a literal with language, or a URI. It has no default value.

  • indexed (boolean), optional, default true

    If indexed, a field is available for Solr queries. true by default.

    This options corresponds to the property "indexed" in the Solr schema.

  • stored (boolean), optional, default true

    Fields can be stored in Solr and this is controlled by the Boolean option "stored". Stored fields are required for retrieving snippets. true by default.

    This option corresponds to the property "stored" in the Solr schema.

  • analyzed (boolean), optional, default true

    When literal fields are indexed in Solr, they will be analysed according to the analyser settings. Should you require that a given field is not analysed, you may use "analyzed". This option has no effect for URIs (they are never analysed). true by default.

    This option affects the Solr type that is used for the field. True uses a type suitable for the values (i.e., text or numeric), while false uses the type "string", which is never analysed by Solr.

  • multivalued (boolean), optional, default true

    RDF properties and synchronized fields may have more than one value. If "multivalued" is set to true, all values will be synchronized to Solr. If set to false, only a single value will be synchronized. true by default.

    This option corresponds to the "multiValued" property in the Solr schema. Note that Solr cannot order results by multivalued fields so you need to adjust your options accordingly.

  • datatype (string), optional, the manual datatype override

    By default, the Solr GraphDB Connector uses datatype of literal values to determine how they must be mapped to Solr types. For more information on the supported datatypes, see Datatype mapping.

    The mapping can be overridden through the property “datatype”, which can be specified per field. The value of “datatype” can be any of the xsd: types supported by the automatic mapping or a native Solr type prefixed by native:, e.g., both xsd:long and native:tlongs map to the tlongs type in Solr.

Special field definitions

Copy fields

Often, it is convenient to synchronize one and the same data multiple times with different settings to accommodate for different use cases, e.g., faceting or sorting vs full-text search. The Solr GraphDB Connector has explicit support for fields that copy their value from another field. This is achieved by specifying a single element in the property chain of the form @otherFieldName, where otherFieldName is another non-copy field. Take the following example:

...
  "fields": [
    {
      "fieldName": "grape",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#madeFromGrape",
        "http://www.w3.org/2000/01/rdf-schema#label"
      ],
      "analyzed": true,
    },
    {
      "fieldName": "grapeFacet",
      "propertyChain": [
        "@grape"
      ],
      "analyzed": false,
    }
  ]
...

The snippet creates an analysed field “grape” and a non-analysed field “grapeFacet”, both fields are populated with the same values and “grapeFacet” is defined as a copy field that refers to the field “facet”.

Note

The connector handles copy fields in a more optimal way than specifying a field with exactly the same property chain as another field.

Multiple property chains per field

Sometimes, you have to work with data models that define the same concept (in terms of what you want to index in Solr) with more than one property chain, e.g., the concept of “name” could be defined as a single canonical name, multiple historical names and some unofficial names. If you want to index these together as a single field in Solr you can define this as a multiple property chains field.

Fields with multiple property chains are defined as a set of separate virtual fields that will be merged into a single physical field when indexed. Virtual fields are distinguished by the suffix $xyz, where xyz is any alphanumeric sequence of convenience. For example, we can define the fields name$1 and name$2 like this:

...
  "fields": [
    {
      "fieldName": "name$1",
      "propertyChain": [
        "http://www.ontotext.com/example#canonicalName"
      ],
      "fieldName": "name$2",
      "propertyChain": [
        "http://www.ontotext.com/example#historicalName"
      ]
      ...
    },
...

The values of the fields name$1 and name$2 will be merged and synchronized to the field name in Solr.

Note

You cannot mix suffixed and unsuffixed fields with the same same, e.g., if you defined myField$new and myField$old you cannot have a field called just myField.

Filters and fields with multiple property chains

Filters can be used with fields defined with multiple property chains. Both the physical field values and the individual virtual field values are available:

  • Physical fields are specified without the suffix, e.g., ?myField

  • Virtual fields are specified with the suffix, e.g., ?myField$2 or ?myField$alt.

Note

Physical fields cannot be combined with parent() as their values come from different property chains. If you really need to filter the same parent level, you can rewrite parent(?myField) in (<urn:x>, <urn:y>) as parent(?myField$1) in (<urn:x>, <urn:y>) || parent(?myField$2) in (<urn:x>, <urn:y>) || parent(?myField$3) ... and surround it with parentheses if it is a part of a bigger expression.

Indexing language tags

The language tag of an RDF literal can be indexed by specifying a property chain, where the last element is the pseudo-URI lang(). The property preceding lang() must lead to a literal value. For example,

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
  inst:my_index :createConnector '''
    {
      "solrUrl": "http://localhost:8983/solr",
      "types": ["http://www.ontotext.com/example#gadget"],
      "fields": [
         {
           "fieldName": "name",
           "propertyChain": [
             "http://www.ontotext.com/example#name"
           ]
         },
         {
           "fieldName": "nameLanguage",
           "propertyChain": [
             "http://www.ontotext.com/example#name",
             "lang()"
           ]
         }
      ],
    }
  ''' .
}

The above connector will index the language tag of each literal value of the property http://www.ontotext.com/example#name into the field nameLanguage.

Indexing the URI of an entity

Sometimes you may need the URI of each entity (e.g., http://www.ontotext.com/example/wine#Franvino from our small example dataset) indexed as a regular field. This can be achieved by specifying a property chain with a single property referring to the pseudo-URI $self. For example,

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
    inst:my_index :createConnector '''
{
  "solrUrl": "http://localhost:8983/solr",
  "types": [
    "http://www.ontotext.com/example/wine#Wine"
  ],
  "fields": [
    {
      "fieldName": "entityId",
      "propertyChain": [
        "$self"
      ],
    },
    {
      "fieldName": "grape",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#madeFromGrape",
        "http://www.w3.org/2000/01/rdf-schema#label"
      ]
    },
  ]
}
''' .
}

The above connector will index the URI of each wine into the field entityId.

Note

Note that GraphDB will also use the URI of each entity as the ID of each document in Solr, which is represented by the field id.

Datatype mapping

The Solr GraphDB Connector maps different types of RDF values to different types of Solr values according to the basic type of the RDF value (URI or literal) and the datatype of literals. The autodetection uses the following mapping:

RDF value

RDF datatype

Solr type

URI

n/a

string

literal

none

text_general or text_xx where xx is language dependent

literal

xsd:boolean

booleans

literal

xsd:double

tdoubles

literal

xsd:float

tfloats

literal

xsd:long

tlongs

literal

xsd:int

tints

literal

xsd:dateTime

tdates

literal

xsd:date

tdates

The datatype mapping can be affected by the synchronization options, too. For example, a non-analysed field that has xsd:long values does not use "tlongs" but "string" instead.

Note

For any given field the automatic mapping uses the first value it sees. This works fine for clean datasets but might lead to problems, if your dataset has non-normalised data, e.g., the first value has no datatype but other values have.

Advanced filtering and fine tuning

entityFilter (string)

The entityFilter parameter is used to fine-tune the set of entities and/or individual values for the configured fields, based on the field value. Entities and field values are synchronized to Solr if, and only if, they pass the filter. The entity filter is similar to a FILTER() inside a SPARQL query but not exactly the same. Each configured field can be referred to, in the entity filter, by prefixing it with a ?, much like referring to a variable in SPARQL. Several operators are supported:

Operator

Meaning

?var in (value1, value2, ...)

Tests if the field var’s value is one of the specified values. Values are compared strictly unlike the similar SPARQL operator, i.e. for literals to match their datatype must be exactly the same (similar to how SPARQL sameTerm works). Values that do not match, are treated as if they were not present in the repository.

Example:
?status in ("active", "new")

?var not in (value1, value2, ...)

The negated version of the in-operator.

Example:
?status not in ("archived")

bound(?var)

Tests if the field var has a valid value. This can be used to make the field compulsory.

Example:
bound(?name)
?var = value (equal to)
?var != value (not equal to)
?var > value (greater than)
?var >= value (greater than or equal to)
?var < value (less than)
?var <= value (less than or equal to)
RDF value comparison operators that compare RDF values similarly to the equivalent SPARQL operators. The field var’s value will be compared to the specified RDF value. When comparing RDF values that are literals, their datatypes must be compatible, e.g., xsd:integer and xsd:long but not xsd:string and xsd:date. Values that do not match are treated as if they were not present in the repository.
Examples:
Given that height’s value is "150"^^xsd:int and dateOfBirth’s value is "1989-12-31"^^xsd:date, then:
?height = "150"^^xsd:int is TRUE
?height = "150"^^xsd:long is TRUE
?height = "150" is FALSE

?height != "151"^^xsd:int is TRUE
?height != "150" is TRUE

?height > "150"^^xsd:int is FALSE
?height >= "150"^^xsd:int is TRUE
?dateOfBirth < "1990-01-01"^^xsd:date is TRUE

regex(?var, "pattern")

or

regex(?var, "pattern", "i")

Tests if the field var’s value matches the given regular expression pattern.
If the “i” flag option is present, this indicates that the match operates in case-insensitive mode.
Values that do not match are treated as if they were not present in the repository.
Example:
regex(?name, "^mrs?", "i")

expr1 || expr2

or

expr1 or expr2

Logical disjunction of expressions expr1 and expr2.

Examples:
bound(?name) || bound(?company)
bound(?name) or bound(?company)

expr1 && expr2

or

expr1 and expr2

Logical conjunction of expressions expr1 and expr2.

Examples:
bound(?status) && ?status in ("active", "new")
bound(?status) and ?status in ("active", "new")

!expr

Logical negation of expression expr.

Example:
!bound(?company)

( expr )

Grouping of expressions

Example:
(bound(?name) or bound(?company)) && bound(?address)

Note

  • ?var in (...) filters the values of ?var and leaves only the matching values, i.e., it will modify the actual data that will be synchronized to Solr

  • bound(?var) checks if there is any valid value left after filtering operators such as ?var in (...) have been applied

In addition to the operators, there are some constructions that can be used to write filters based not on the values but on values related to them:

Accessing the previous element in the chain

The construction parent(?var) is used for going to a previous level in a property chain. It can be applied recursively as many times as needed, e.g., parent(parent(parent(?var))) goes back in the chain three times. The effective value of parent(?var) can be used with the in or not in operator like this: parent(?company) in (<urn:a>, <urn:b>), or in the bound operator like this: parent(bound(?var)).

Accessing an element beyond the chain

The construction ?var -> uri (alternatively, ?var o uri or just ?var uri) is used for accessing additional values that are accessible through the property uri. In essence, this construction corresponds to the triple pattern value uri ?effectiveValue, where ?value is a value bound by the field var. The effective value of ?var -> uri can be used with the in or not in operator like this: ?company -> rdf:type in (<urn:c>, <urn:d>). It can be combined with parent() like this: parent(?company) -> rdf:type in (<urn:c>, <urn:d>). The same construction can be applied to the bound operator like this: bound(?company -> <urn:hasBranch>), or even combined with parent() like this: bound(parent(?company) -> <urn:hasGroup>).

The URI parameter can be a full URI within < > or the special string rdf:type (alternatively, just type), which will be expanded to http://www.w3.org/1999/02/22-rdf-syntax-ns#type.

Filtering by RDF graph

The construction graph(?var) is used for accessing the RDF graph of a field’s value. The typical use case is to sync only explicit values: graph(?a) not in (<http://www.ontotext.com/implicit>). The construction can be combined with parent() like this: graph(parent(?a)) in (<urn:a>).

Filtering by language tags

The construction lang(?var) is used for accessing the language tag of field’s value (only RDF literals can have a language tag). The typical use case is to sync only values written in a given language: lang(?a) in ("de", "it", "no"). The construction can be combined with parent() and an element beyond the chain like this: lang(parent(?a) -> <http://www.w3.org/2000/01/rdf-schema#label>) in ("en", "bg"). Literal values without language tags can be filtered by using an empty tag: "".

Entity filters and default values

Entity filters can be combined with default values in order to get more flexible behavior.

A typical use-case for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as deleted by the presence of a specific value for a given property.

Basic entity filter example

Given the following RDF data:

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.ontotext.com/example#> .

# the entity below will be synchronized because it has a matching value for city: ?city in ("London")
:alpha
    rdf:type :gadget ;
    :name "John Synced" ;
    :city "London" .

# the entity below will not be synchronized because it lacks the property completely: bound(?city)
:beta
    rdf:type :gadget ;
    :name "Peter Syncfree" .

# the entity below will not be synchronized because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
:gamma
    rdf:type :gadget ;
    :name "Mary Syncless" ;
    :city "Liverpool" .

If you create a connector instance such as:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
  inst:my_index :createConnector '''
    {
      "solrUrl": "http://localhost:8983/solr",
      "types": ["http://www.ontotext.com/example#gadget"],
      "fields": [
         {
           "fieldName": "name",
           "propertyChain": ["http://www.ontotext.com/example#name"]
         },
         {
           "fieldName": "city",
           "propertyChain": ["http://www.ontotext.com/example#city"]
         }
      ],
      "entityFilter":"bound(?city) && ?city in (\\"London\\")"
    }
  ''' .
}

The entity :beta is not synchronized as it has no value for city.

To handle such cases, you can modify the connector configuration to specify a default value for city:

...
         {
           "fieldName": "city",
           "propertyChain": ["http://www.ontotext.com/example#city"],
           "defaultValue": "London"
         }
...
}

The default value is used for the entity :beta as it has no value for city in the repository. As the value is “London”, the entity is synchronized.

Advanced entity filter example

Sometimes, data represented in RDF is not well suited to map directly to non-RDF. For example, if you have news articles and they can be tagged with different concepts (locations, persons, events, etc.), one possible way to model this is a single property :taggedWith. Consider the following RDF data:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.ontotext.com/example2#> .

:Berlin
    rdf:type :Location ;
    rdfs:label "Berlin" .

:Mozart
    rdf:type :Person ;
    rdfs:label "Wolfgang Amadeus Mozart" .

:Einstein
    rdf:type :Person ;
    rdfs:label "Albert Einstein" .

:Cannes-FF
    rdf:type :Event ;
    rdfs:label "Cannes Film Festival" .

:Article1
    rdf:type :Article ;
    rdfs:comment "An article about a film about Einstein's life while he was a professor in Berlin." ;
    :taggedWith :Berlin ;
    :taggedWith :Einstein ;
    :taggedWith :Cannes-FF .

:Article2
    rdf:type :Article ;
    rdfs:comment "An article about Berlin." ;
    :taggedWith :Berlin .

:Article3
    rdf:type :Article ;
    rdfs:comment "An article about Mozart's life." ;
    :taggedWith :Mozart .

:Article4
    rdf:type :Article ;
    rdfs:comment "An article about classical music in Berlin." ;
    :taggedWith :Berlin ;
    :taggedWith :Mozart .

:Article5
    rdf:type :Article ;
    rdfs:comment "A boring article that has no tags." .

:Article6
    rdf:type :Article ;
    rdfs:comment "An article about the Cannes Film Festival in 2013." ;
    :taggedWith :Cannes-FF .

Now, if you map this data to Solr so that the property :taggedWith x is mapped to separate fields taggedWithPerson and taggedWithLocation according to the type of x (we are not interested in events), you can map taggedWith twice to different fields and then use an entity filter to get the desired values:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
  inst:my_index :createConnector '''
    {
      "solrUrl": "http://localhost:8983/solr",
      "types": ["http://www.ontotext.com/example2#Article"],
      "fields": [
         {
            "fieldName": "comment",
            "propertyChain": ["http://www.w3.org/2000/01/rdf-schema#comment"]
         },
         {
           "fieldName": "taggedWithPerson",
           "propertyChain": ["http://www.ontotext.com/example2#taggedWith"]
         },
         {
           "fieldName": "taggedWithLocation",
           "propertyChain": ["http://www.ontotext.com/example2#taggedWith"]
         }
      ],
      "entityFilter": "?taggedWithPerson type in (<http://www.ontotext.com/example2#Person>)
                        && ?taggedWithLocation type in (<http://www.ontotext.com/example2#Location>)"
    }
  ''' .
}

Note

type is the short way to write <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>.

The six articles in the RDF data above will be mapped as such:

Article URI

Value in taggedWithPerson

Value in taggedWithLocation

Explanation

:Article1

:Einstein

:Berlin

:taggedWith has the values :Einstein, :Berlin and :Cannes-FF. The filter leaves only the correct values in the respective fields. The value :Cannes-FF is ignored as it does not match the filter.

:Article2

:Berlin

:taggedWith has the value :Berlin. After the filter is applied, only taggedWithLocation is populated.

:Article3

:Mozart

:taggedWith has the value :Mozart. After the filter is applied, only taggedWithPerson is populated

:Article4

:Mozart

:Berlin

:taggedWith has the values :Berlin and :Mozart. The filter leaves only the correct values in the respective fields.

:Article5

:taggedWith has no values. The filter is not relevant.

:Article6

:taggedWith has the value :Cannes-FF. The filter removes it as it does not match.

This can be checked by issuing a faceted search for taggedWithLocation and taggedWithPerson:

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

SELECT ?facetName ?facetValue ?facetCount {
  ?search a inst:my_index ;
      :facetFields "taggedWithLocation,taggedWithPerson" ;
      :facets _:f .
  _:f :facetName ?facetName ;
      :facetValue ?facetValue ;
      :facetCount ?facetCount .
}

If the filter was applied, you should get only :Berlin for taggedWithLocation and only :Einstein and :Mozart for taggedWithPerson:

?facetName

?facetValue

?facetCount

taggedWithLocation

http://www.ontotext.com/example2#Berlin

3

taggedWithPerson

http://www.ontotext.com/example2#Mozart

2

taggedWithPerson

http://www.ontotext.com/example2#Einstein

1

Overview of connector predicates

The following diagram shows a summary of all predicates that can administer (create, drop, check status) connector instances or issue queries and retrieve results. It can be used as a quick reference of what a particular predicate needs to be attached to. For example, to retrieve entities, you need to use :entities on a search instance and to retrieve snippets, you need to use :snippets on an entity. Variables that are bound as a result of a query are shown in green, blank helper nodes are shown in blue, literals in red, and URIs in orange. The predicates are represented by labeled arrows.

scale 0.85
left to right direction

skinparam activity {
  BackgroundColor<<BNode>> #D1E0FF
  BackgroundColor<<Var>> #D1FFD1
  BackgroundColor<<URI>> #FFCC80
  BackgroundColor #FFE3E3
}

partition "Instance level" {
  "instance URI" <<URI>> -->[:createConnector] "JSON params"
  "instance URI" -->[:dropConnector] "dummy value"
  "instance URI" -->[:repairConnector] "dummy value"
  "instance URI" -->[:connectorStatus] "?status" <<Var>>
  "_:search" <<BNode>> -->[rdf:type] "instance URI"
}

partition "Search level: query and options" {
  "_:search" -->[:query] "query value"
  "_:search" -->[:limit] "limit value"
  "_:search" -->[:offset] "offset value"
  "_:search" -->[:orderBy] "order by expression"
  "_:search" -->[:facetFields] "field name list"
  "_:search" -->[:snippetSize] "snippet size value"
  "_:search" -->[:snippetSpanOpen] "string"
  "_:search" -->[:snippetSpanClose] "string"
}

partition "Search level: results"
  "_:search" -->[:entities] "?entity" <<Var>>
  "_:search" -->[:totalHits] "?totalHits" <<Var>>
  "_:search" -->[:facets] "_:facet" <<BNode>>
  "_:search" -->[:aggregations] "_:aggregation" <<BNode>>
}

partition "Entity level" {
  "?entity" -->[:score] "?score" <<Var>>
  "?entity" -->[:snippets] "_:snippet" <<BNode>>
}

partition "Snippet level" {
  "_:snippet" -->[:snippetField] "?snippetField" <<Var>>
  "_:snippet" -->[:snippetText] "?snippetText" <<Var>>
}

partition "Facet level" {
  "_:facet" -->[:facetName] "?facetName" <<Var>>
  "_:facet" -->[:facetValue] "?facetValue" <<Var>>
  "_:facet" -->[:facetCount] "?facetCount" <<Var>>
}

partition "Aggregation level" {
  "_:aggregation" -->[:name] "?aggrName" <<Var>>
  "_:aggregation" -->[:key] "?aggrKey" <<Var>>
  "_:aggregation" -->[:count] "?aggrCount" <<Var>>
  "_:aggregation" -->[:from] "?aggrFrom" <<Var>>
  "_:aggregation" -->[:to] "?aggrTo" <<Var>>
  "_:aggregation" -->[:rangeGap] "?aggrGap" <<Var>>
  "_:aggregation" -->[:beforeCount] "?aggrBefore" <<Var>>
  "_:aggregation" -->[:afterCount] "?aggrAfter" <<Var>>
  "_:aggregation" -->[:betweenCount] "?aggrBetween" <<Var>>
  "_:aggregation" -->[:parent] "?aggrParent" <<Var>>
  "_:aggregation" -->[:level] "?aggrLevel" <<Var>>
  "_:aggregation" -->[:levelName] "?aggrLevelName" <<Var>>
}

SolrCloud support

From GraphDB 8.0/Connectors 6.0, the Solr connector has SolrCloud support. SolrCloud is the distributed version of Solr, which offers index sharding, better scaling, fault tolerance, etc. It uses Apache Zookeeper for distributed synchronization and central configuration of the Solr nodes. The Solr indexes are called collections, which is the sharded version of cores.

Zookeeper instances

Creating a SolrCloud connector is the same as creating a Solr connector with the only difference in the syntax of the solrUrl parameter:

"solrUrl":"zk://localhost:2181|numShards=2|replicationFactor=2|maxShardsPerNode=3"

zk://localhost:2181 is the host and port of the started Zookeeper instance and the rest are the parameters for creating the SolrCloud collection, delimited with pipes. The supported cluster parameters are:

  • numShards

  • replicationFactor

  • maxShardsPerNode

  • autoAddReplicas

  • router.name

  • router.field

  • shards

Note

numShards and replicationFactor are mandatory parameters. maxShardsPerNode is set to numShards value when absent.

For more information on how to use these options, check the SolrCloud’s Collection API documentation.

You can also have multiple Zookeeper instances orchestrating the Solr nodes. They have to be mentioned in the connection string.

"solrUrl":"zk://localhost:2181,zk://localhost:2182|numShards=2|replicationFactor=2|maxShardsPerNode=3"

Note

The Zookeeper instances must be running on the same hosts as in the solrUrl parameter.

More information on how to setup a SolrCloud cluster.

SolrCloud collection configsets

Unlike the standard Solr cores, where each core has a /conf directory containing all of its configurations, SolrCloud collections decouple the configuration from the data. The configurations are called configsets and they reside in the Zookeeper instances. Before you want to create a new collection, you have to upload all your default or custom configurations to Zookeeper under specific names.

Note

Check Command Line Utilities and ConfigSets API from SolrCloud documentation on how to upload configsets.

When creating a SolrCloud connector, you have to specify the configset name in the copyConfigsFrom parameter. If you do not specify it, it will search for a default configset name, which is collection1. As a good practice, it is recommended to upload your default configuration under the name collection1, and then, when you want to create a new connector with default index configuration, you will not have to specify this parameter again. Otherwise, for other custom configsets, use the parameter with the name of the custom configset, i.e., customConfigset.

Example: Create SolrCloud connector query using a custom configset

PREFIX : <http://www.ontotext.com/connectors/solr#>
PREFIX inst: <http://www.ontotext.com/connectors/solr/instance#>

INSERT DATA {
    inst:my_collection :createConnector '''
{
  "solrUrl": "zk://localhost:2181|numShards=2|replicationFactor=2|maxShardsPerNode=3",
  "copyConfigsFrom": "customConfigset"
  "types": [
    "http://www.ontotext.com/example/wine#Wine"
  ],
  "fields": [
    {
      "fieldName": "grape",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#madeFromGrape",
        "http://www.w3.org/2000/01/rdf-schema#label"
      ]
    },
    {
      "fieldName": "sugar",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#hasSugar"
      ],
      "multivalued": false
    },
    {
      "fieldName": "year",
      "propertyChain": [
        "http://www.ontotext.com/example/wine#hasYear"
      ]
    }
  ]
}
''' .
}

Caveats

Order of control

Even though SPARQL per se is not sensitive to the order of triple patterns, the Solr GraphDB Connector expects to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results.

The diagram in Overview of connector predicates provides a quick overview of the predicates.

Upgrading from previous versions

Migrating from GraphDB 6.2 to 6.6

There are no new connector options in GraphDB 7.

The Solr Connector in GraphDB 6.2 to 6.6 uses Solr 4.x and the Solr Connector in GraphDB 7 uses Solr 5.x. Connector instances created with GraphDB 6.2 to 6.6 are compatible with GraphDB 7 but since this is a major version change for Solr we recommend to drop and recreate all connector instances.

Migrating from a pre-6.2 version

GraphDB prior to 6.2 shipped with version 3.x of the Solr GraphDB Connector that had different options and slightly different behavior and internals. Unfortunately, it is not possible to migrate existing connector instances automatically. To prevent any data loss, the Solr GraphDB Connector will not initialize, if it detects an existing connector in the old format. The recommended way to migrate your existing instances is:

  1. Backup the INSERT statement used to create the connector instance.

  2. Drop the connector.

  3. Deploy the new GraphDB version.

  4. Modify the INSERT statement according to the changes described below.

  5. Re-create the connector instance with the modified INSERT statement.

You might also need to change your queries to reflect any changes in field names or extra fields.

Changes in field configuration and synchronization

Prior to 6.2, a single field in the config could produce up to three individual fields on the Solr side, based on the field options. For example, for the field "firstName":

field

note

firstName

produced, if the option “index” was true; used explicitly in queries

_facet_firstName

produced, if the option “facet” was true; used implicitly for facet search

_sort_firstName

produced, if the option “sort” was true; used implicitly for ordering connector results

The current version always produces a single Solr field per field definition in the configuration. This means that you have to create all appropriate fields based on your needs. See more in List of creation parameters.

Tip

To mimic the functionality of the old _facet_fieldName fields, you can either create a non-analysed Copy fields (for textual fields) or just use the normal field (for non-textual fields).

Tip

To mimic the functionality of the old _sort_fieldName fields, you can either create a non-analysed Copy fields (for textual fields) or just use the normal field (for non-textual fields).

Solr imposes an additional requirement that sort fields have to be non-multivalued.

The option manageExternalIndex

Prior to 6.2, the option manageExternalIndex could be used to control the management of both the schema and the core. In the current implementation, there are separate options, manageSchema and manageCore. For more information, see Schema and core management.