GraphDB Free 7.1
Table of contents
- General
- Quick start guide
- Installation
- Administration
- Usage
- Tools
- References
- Release notes
- FAQ
- Support
Lucene GraphDB connector¶
Overview and features¶
The GraphDB Connectors provide extremely fast normal and faceted (aggregation) searches, typically implemented by an external component or a service such as Lucene but have the additional benefit of staying automatically up-to-date with the GraphDB repository data.
The Connectors provide synchronisation at the entity level, where an entity is defined as having a unique identifier (a URI) and a set of properties and property values. In terms of RDF, this corresponds to a set of triples that have the same subject. In addition to simple properties (defined by a single triple), the Connectors support property chains. A property chain is defined as a sequence of triples where each triple’s object is the subject of the following triple.
The main features of the GraphDB Connectors are:
- maintaining an index that is always in sync with the data stored in GraphDB;
- multiple independent instances per repository;
- the entities for synchronisation are defined by:
- a list of fields (on the Lucene side) and property chains (on the GraphDB side) whose values will be synchronised;
- a list of
rdf:type
‘s of the entities for synchronisation; - a list of languages for synchronisation (the default is all languages);
- additional filtering by property and value.
- full-text search using native Lucene queries;
- snippet
extraction:
highlighting of search terms in the search result; - faceted search;
- sorting by any preconfigured field;
- paging of results using
offset
andlimit
; - custom mapping of RDF types to Lucene types;
- specifying which Lucene analyzer to use (the default is Lucene’s
StandardAnalyzer
); - stripping HTML/XML tags in literals (the default is not to strip markup);
- boosting an entity by the numeric value of one or more predicates;
- custom scoring expressions at query time to evaluate score based on Lucene score and entity boost.
Each feature is described in detail below.
Usage¶
All interactions with the Lucene GraphDB Connector shall be done through SPARQL queries.
There are three types of SPARQL queries:
INSERT
for creating and deleting connector instances;SELECT
for listing connector instances and querying their configuration parameters;INSERT
/SELECT
for storing and querying data as part of the normal GraphDB data workflow.
In general, this corresponds to INSERT
adds or modifies data and
SELECT
queries existing data.
Each connector implementation defines its own URI prefix to distinguish
it from other connectors. For the Lucene GraphDB Connector, this is
http://www.ontotext.com/connectors/lucene#
. Each command or predicate
executed by the connector uses this prefix, e.g.,
http://www.ontotext.com/connectors/lucene#createConnector
to create a
connector instance for Lucene.
Individual instances of a connector are distinguished by unique names
that are also URIs. They have their own prefix to avoid clashing with
any of the command predicates. For Lucene, the instance prefix is
http://www.ontotext.com/connectors/lucene/instance#
.
- Sample data
All examples use the following sample data, which describes five fictitious wines: Yoyowine, Franvino, Noirette, Blanquito and Rozova as well as the grape varieties required to make these wines. The minimum required ruleset level in GraphDB is RDFS.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix : <http://www.ontotext.com/example/wine#> . :RedWine rdfs:subClassOf :Wine . :WhiteWine rdfs:subClassOf :Wine . :RoseWine rdfs:subClassOf :Wine . :Merlo rdf:type :Grape ; rdfs:label "Merlo" . :CabernetSauvignon rdf:type :Grape ; rdfs:label "Cabernet Sauvignon" . :CabernetFranc rdf:type :Grape ; rdfs:label "Cabernet Franc" . :PinotNoir rdf:type :Grape ; rdfs:label "Pinot Noir" . :Chardonnay rdf:type :Grape ; rdfs:label "Chardonnay" . :Yoyowine rdf:type :RedWine ; :madeFromGrape :CabernetSauvignon ; :hasSugar "dry" ; :hasYear "2013"^^xsd:integer . :Franvino rdf:type :RedWine ; :madeFromGrape :Merlo ; :madeFromGrape :CabernetFranc ; :hasSugar "dry" ; :hasYear "2012"^^xsd:integer . :Noirette rdf:type :RedWine ; :madeFromGrape :PinotNoir ; :hasSugar "medium" ; :hasYear "2012"^^xsd:integer . :Blanquito rdf:type :WhiteWine ; :madeFromGrape :Chardonnay ; :hasSugar "dry" ; :hasYear "2012"^^xsd:integer . :Rozova rdf:type :RoseWine ; :madeFromGrape :PinotNoir ; :hasSugar "medium" ; :hasYear "2013"^^xsd:integer .
Setup and maintenance¶
- Third-party component versions
- This version of the Lucene GraphDB Connector uses Lucene version 5.5.0.
Creating a connector instance¶
Creating a connector instance is done by sending a SPARQL query with the following configuration data:
- the name of the connector instance (e.g.,
my_index
); - classes to synchronise;
- properties to synchronise.
The configuration data has to be provided as a JSON string representation and passed together with the create command.
Tip
Use the GraphDB Connectors management interface provided by the GraphDB Workbench as it lets you create the configuration easily, and then create the connector instance directly or copy the configuration and execute it elsewhere.
The create command is triggered by a SPARQL INSERT
with the
createConnector
predicate, e.g., it creates a connector instance
called my_index
, which synchronises the wines from the sample data
above:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
INSERT DATA {
inst:my_index :createConnector '''
{
"types": [
"http://www.ontotext.com/example/wine#Wine"
],
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
]
},
{
"fieldName": "sugar",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasSugar"
],
"multivalued": false
},
{
"fieldName": "year",
"propertyChain": [
"http://www.ontotext.com/example/wine#hasYear"
]
}
]
}
''' .
}
The above command creates a new Lucene connector instance.
The "types"
key defines the RDF type of the entities to synchronise and,
in the example, it is only entities of the type http://www.ontotext.com/example/wine#Wine
(and its subtypes). The "fields"
key defines the mapping from RDF to
Lucene. The basic building block is the property chain, i.e., a sequence
of RDF properties where the object of each property is the subject of
the following property. In the example, three bits of information are
mapped - the grape the wines are made of, sugar content, and year. Each
chain is assigned a short and convenient field name: “grape”, “sugar”,
and “year”. The field names are later used in the queries.
Grape is an example of a property chain composed of more than one
property. First, we take the wine’s madeFromGrape
property, the object
of which is an instance of the type Grape, and then we take the
rdfs:label
of this instance. Sugar and year are both composed of a
single property that links the value directly to the wine.
Dropping a connector instance¶
Dropping a connector instance removes all references to its external store from GraphDB as well as all Lucene files associated with it.
The drop command is triggered by a SPARQL INSERT
with the
dropConnector
predicate where the name of the connector instance has
to be in the subject position, e.g., this removes the connector
my_index
:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
INSERT DATA {
inst:my_index :dropConnector "" .
}
Listing available connector instances¶
Listing connector instances returns all previously created instances. It
is a SELECT
query with the listConnectors
predicate:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
SELECT ?cntUri ?cntStr {
?cntUri :listConnectors ?cntStr .
}
?cntUri
is bound to the prefixed URI of the connector instance that
was used during creation, e.g., http://www.ontotext.com/connectors/lucene/instance#my_index>
,
while ?cntStr
is bound to a string, representing the part after the
prefix, e.g., "my_index"
.
Instance status check¶
The internal state of each connector instance can be queried using a
SELECT
query and the connectorStatus
predicate:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
SELECT ?cntUri ?cntStatus {
?cntUri :connectorStatus ?cntStatus .
}
?cntUri
is bound to the prefixed URI of the connector instance,
while ?cntStatus
is bound to a string representation of the status
of the connector represented by this URI. The status is key-value based.
Working with data¶
Adding, updating and deleting data¶
From the user point of view, all synchronisation happens transparently
without using any additional predicates or naming a specific store
explicitly, i.e., you must simply execute standard SPARQL
INSERT
/DELETE
queries. This is achieved by intercepting all changes in
the plugin and determining which abstract documents need to be updated.
Simple queries¶
Once a connector instance has been created, it is possible to query data
from it through SPARQL. For each matching abstract document, the
connector instance returns the document subject. In its simplest form,
querying is achieved by using a SELECT
and providing the Lucene
query as the object of the query
predicate:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?entity {
?search a inst:my_index ;
:query "grape:cabernet" ;
:entities ?entity .
}
The result binds ?entity
to the two wines made from grapes that have
“cabernet” in their name, namely :Yoyowine
and :Franvino
.
Note
You must use the field names you chose when you created the connector instance. They can be identical to the property URIs but you must escape any special characters according to what Lucene expects.
- Get a query instance of the requested connector instance by using the
RDF notation
"X a Y" (= X rdf:type Y)
, whereX
is a variable andY
is a connector instance URI.X
is bound to a query instance of the connector instance. - Assign a query to the query instance by using the system predicate
:query
. - Request the matching entities through the
:entities
predicate.
It is also possible to provide per query search options by using one or more option predicates. The option predicates are described in detail below.
Combining Lucene results with GraphDB data¶
The bound ?entity
can be used in other SPARQL triples in order to build
complex queries that fetch additional data from GraphDB, for example, to
see the actual grapes in the matching wines as well as the year they
were made:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
PREFIX wine: <http://www.ontotext.com/example/wine#>
SELECT ?entity ?grape ?year {
?search a inst:my_index ;
:query "grape:cabernet" ;
:entities ?entity .
?entity wine:madeFromGrape ?grape .
?entity wine:hasYear ?year
}
The result looks like this:
?entity |
?grape |
?year |
---|---|---|
:Yoyowine |
:CabernetSauvignon |
2013 |
:Franvino |
:Merlo |
2012 |
:Franvino |
:CabernetFranc |
2012 |
Note
:Franvino
is returned twice because it is made from two
different grapes, both of which are returned.
Entity match score¶
It is possible to access the match score returned by Lucene with the
score
predicate. As each entity has its own score, the predicate
should come at the entity level. For example:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?entity ?score {
?search a inst:my_index ;
:query "grape:cabernet" ;
:entities ?entity .
?entity :score ?score
}
The result looks like this but the actual score might be different as it depends on the specific Lucene version:
?entity |
?score |
:Yoyowine |
0.9442660212516785 |
:Franvino |
0.7554128170013428 |
Basic facet queries¶
Consider the sample wine data and the my_index
connector instance
described previously. You can also query facets using the same instance:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?facetName ?facetValue ?facetCount WHERE {
# note empty query is allowed and will just match all documents, hence no :query
?r a inst:my_index ;
:facetFields "year,sugar" ;
:facets _:f .
_:f :facetName ?facetName .
_:f :facetValue ?facetValue .
_:f :facetCount ?facetCount .
}
It is important to specify the facet fields by using the facetFields
predicate. Its value is a simple comma-delimited list of field names. In
order to get the faceted results, use the facets predicate. As each
facet has three components (name, value and count), the facets predicate
binds a blank node, which in turn can be used to access the individual
values for each component through the predicates facetName
,
facetValue
, and facetCount
.
The resulting bindings look like the following:
facetName | facetValue | facetCount |
---|---|---|
year | 2012 | 3 |
year | 2013 | 2 |
sugar | dry | 3 |
sugar | medium | 2 |
You can easily see that there are three wines produced in 2012 and two in 2013. You also see that three of the wines are dry, while two are medium. However, it is not necessarily true that the three wines produced in 2012 are the same as the three dry wines as each facet is computed independently.
Sorting¶
It is possible to sort the entities returned by a connector query
according to one or more fields. Sorting is achieved by the orderBy
predicate the value of which is a comma-delimited list of fields. Each
field can be prefixed with a minus to indicate sorting in descending
order. For example:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?entity {
?search a inst:my_index ;
:query "year:2013" ;
:orderBy "-sugar" ;
:entities ?entity .
}
The result contains wines produced in 2013 sorted according to their sugar content in descending order:
entity |
---|
Rozova |
Yoyowine |
By default, entities are sorted according to their matching score in descending order.
Note
If you join the entity from the connector query to other
triples stored in GraphDB, GraphDB might scramble the order. To
remedy this, use ORDER BY
from SPARQL.
Tip
Sorting by an analysed textual field works but might produce
unexpected results. Analysed textual fields are composed of tokens
and sorting uses the least (in the lexicographical sense) token. For
example, “North America” will be sorted before “Europe” because the
token “america” is lexicographically smaller than the token
“europe”. If you need to sort by a textual field and still do
full-text search on it, it is best to create a copy of the field
with the setting "analyzed":false
. For more information, see
Copy fields.
Note
Unlike Lucene 4, which was used in GraphDB 6.x, Lucene 5 imposes
an additional requirement on fields used for sorting.
They must be defined with multivalued = false
.
Limit and offset¶
Limit and offset are supported on the Lucene side of the query. This is
achieved through the predicates limit
and offset
. Consider this
example in which an offset of 1
and a limit of 1
are specified:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?entity {
?search a inst:my_index ;
:query "sugar:dry" ;
:offset "1" ;
:limit "1" ;
:entities ?entity .
}
The result contains a single wine, Franvino. If you execute the query without the limit and offset, Franvino will be second in the list:
entity |
---|
Yoyowine |
Franvino |
Blanquito |
Note
The specific order in which GraphDB returns the results depends on how Lucene returns the matches, unless sorting is specified.
Snippet extraction¶
Snippet extraction is used for extracting highlighted snippets of text that
match the query. The snippets are accessed through the dedicated
predicate snippets
. It binds a blank node that in turn provides the
actual snippets via the predicates snippetField
and snippetText
.
The predicate snippets must be attached to the entity, as each entity
has a different set of snippets. For example, in a search for Cabernet:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?entity ?snippetField ?snippetText {
?search a inst:my_index ;
:query "grape:cabernet" ;
:entities ?entity .
?entity :snippets _:s .
_:s :snippetField ?snippetField ;
:snippetText ?snippetText .
}
the query returns the two wines made from Cabernet Sauvignon or Cabernet Franc grapes as well as the respective matching fields and snippets:
?entity |
?snippetField |
?snippetText |
---|---|---|
:Yoyowine |
grape | <em>Cabernet</em> Sauvignon |
:Franvino |
grape | <em>Cabernet</em> Franc |
Note
The actual snippets might be different as this depends on the specific Lucene implementation.
It is possible to tweak how the snippets are collected/composed by using the following option predicates:
:snippetSize
- sets the maximum size of the extracted text fragment,250
by default;:snippetSpanOpen
- text to insert before the highlighted text,<em>
by default;:snippetSpanClose
- text to insert after the highlighted text,</em>
by default.
The option predicates are set on the query instance, much like the
:query
predicate.
Total hits¶
You can get the total number of hits by using the totalHits
predicate, e.g., for the connector instance my_index
and a query that
retrieves all wines made in 2012:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?totalHits {
?r a inst:my_index ;
:query "year:2012" ;
:totalHits ?totalHits .
}
As there are three wines made in 2012, the value 3
(of type xdd:long
)
binds to ?totalHits
.
List of creation parameters¶
The creation parameters define how a connector instance is created by
the :createConnector
predicate. Some are required and some are optional.
All parameters are provided together in a JSON object, where the
parameter names are the object keys. Parameter values may be simple JSON
values such as a string or a boolean, or they can be lists or objects.
All of the creation parameters can also be set conveniently from the Create Connector user interface in the GraphDB Workbench without any knowledge of JSON.
analyzer
(string), optional, specifies Lucene analyserThe Lucene Connector supports custom Analyser implementations. They may be specified via the
analyzer
parameter whose value must be a fully qualified name of a class that extendsorg.apache.lucene.analysis.Analyzer
. The class requires either a default constructor or a constructor with exactly one parameter of typeorg.apache.lucene.util.Version
. For example, these two classes are valid implementations:package com.ontotext.example; import org.apache.lucene.analysis.Analyzer; public class FancyAnalyzer extends Analyzer { public FancyAnalyzer() { ... } ... }
package com.ontotext.example; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.util.Version; public class SmartAnalyzer extends Analyzer { public SmartAnalyzer(Version luceneVersion) { ... } ... }
FancyAnalyzer
andSmartAnalyzer
can then be used by specifying their fully qualified names, for example:... "analyzer": "com.ontotext.example.SmartAnalyzer", ...
types
(list of URI), required, specifies the types of entities to sync- The RDF types of entities to sync are specified as a list of URIs. At least one type URI is required.
languages
(list of string), optional, valid languages for literals- RDF data is often multilingual but you can map only some of the languages represented in the literal values. This can be done by specifying a list of language ranges to be matched to the language tags of literals according to RFC 4647, Section 3.3.1. Basic Filtering. In addition, an empty range can be used to include literals that have no language tag. The list of language ranges maps all existing literals that have matching language tags.
fields
(list of field object), required, defines the mapping from RDF to LuceneThe fields define exactly what parts of each entity will be synchronised as well as the specific details on the connector side. The field is the smallest synchronisation unit and it maps a property chain from GraphDB to a field in Lucene. The fields are specified as a list of field objects. At least one field object is required. Each field object has further keys that specify details.
fieldName
(string), required, the name of the field in LuceneThe name of the field defines the mapping on the connector side. It is specified by the key
fieldName
with a string value. The field name is used at query time to refer to the field. There are few restrictions on the allowed characters in a field name but to avoid unnecessary escaping (which depends on how Lucene parses its queries), we recommend to keep the field names simple.
propertyChain
(list of URI), required, defines the property chain to reach the valueThe property chain (
propertyChain
) defines the mapping on the GraphDB side. A property chain is defined as a sequence of triples where the entity URI is the subject of the first triple, its object is the subject of the next triple and so on. In this model, a property chain with a single element corresponds to a direct property defined by a single triple. Property chains are specified as a list of URIs where at least one URI must be provided.See Copy fields for defining multiple fields with the same property chain.
See Multiple property chains per field for defining a field whose values are populated from more than one property chain.
defaultValue
(string), optional, specifies a default value for the fieldThe default value (
defaultValue
) provides means for specifying a default value for the field when the property chain has no matching values in GraphDB. The default value can be a plain literal, a literal with a datatype (xsd:
prefix supported), a literal with language, or a URI. It has no default value.
indexed
(boolean), optional, defaulttrue
If indexed, a field is available for Lucene queries.
true
by default.This option corresponds to Lucene’s field option
"indexed"
.
stored
(boolean), optional, defaulttrue
Fields can be stored in Lucene and this is controlled by the Boolean option
"stored"
. Stored fields are required for retrieving snippets.true
by default.This options corresponds to Lucene’s property
"stored"
.
analyzed
(boolean), optional, defaulttrue
When literal fields are indexed in Lucene, they will be analysed according to the analyser settings. Should you require that a given field is not analysed, you may use
"analyzed"
. This option has no effect for URIs (they are never analysed).true
by default.This option corresponds to Lucene’s property “tokenized”.
multivalued
(boolean), optional, defaulttrue
RDF properties and synchronised fields may have more than one value. If
"multivalued"
is set totrue
, all values will be synchronised to Lucene. If set tofalse
, only a single value will be synchronised.true
by default.
facet
(boolean), optional, default trueLucene needs to index data in a special way, if it will be used for faceted search. This is controlled by the Boolean option “facet”. True by default. Fields that are not synchronised for faceting are also not available for faceted search.
datatype
(string), optional, the manual datatype overrideBy default, the Lucene GraphDB Connector uses datatype of literal values to determine how they must be mapped to Lucene types. For more information on the supported datatypes, see Datatype mapping.
The datatype mapping can be overridden through the parameter
"datatype"
, which can be specified per field. The value of"datatype"
can be any of thexsd:
types supported by the automatic mapping.
Special field definitions¶
Copy fields¶
Often, it is convenient to synchronise one and the same data multiple
times with different settings to accommodate for different use cases,
e.g., faceting or sorting vs full-text search. The Lucene GraphDB
Connector has explicit support for fields that copy their value from
another field. This is achieved by specifying a single element in the
property chain of the form @otherFieldName
, where otherFieldName
is
another non-copy field. Take the following example:
...
"fields": [
{
"fieldName": "grape",
"propertyChain": [
"http://www.ontotext.com/example/wine#madeFromGrape",
"http://www.w3.org/2000/01/rdf-schema#label"
],
"analyzed": true,
},
{
"fieldName": "grapeFacet",
"propertyChain": [
"@grape"
],
"analyzed": false,
}
]
...
The snippet creates an analysed field “grape” and a non-analysed field “grapeFacet”, both fields are populated with the same values and “grapeFacet” is defined as a copy field that refers to the field “facet”.
Note
The connector handles copy fields in a more optimal way than specifying a field with exactly the same property chain as another field.
Multiple property chains per field¶
Sometimes, you have to work with data models that define the same concept (in terms of what you want to index in Lucene) with more than one property chain, e.g., the concept of “name” could be defined as a single canonical name, multiple historical names and some unofficial names. If you want to index these together as a single field in Lucene you can define this as a multiple property chains field.
Fields with multiple property chains are defined as a set of separate
virtual fields that will be merged into a single physical field
when indexed. Virtual fields are distinguished by the suffix /xyz
,
where xyz
is any alphanumeric sequence of convenience. For example,
we can define the fields name/1
and name/2
like this:
...
"fields": [
{
"fieldName": "name/1",
"propertyChain": [
"http://www.ontotext.com/example#canonicalName"
],
"fieldName": "name/2",
"propertyChain": [
"http://www.ontotext.com/example#historicalName"
]
...
},
...
The values of the fields name/1
and name/2
will be merged
and synchronised to the field name
in Lucene.
Note
You cannot mix suffixed and unsuffixed fields with the same same,
e.g., if you defined myField/new
and myField/old
you cannot have
a field called just myField
.
Filters and fields with multiple property chains¶
Filters can be used with fields defined with multiple property chains. Both the physical field values and the individual virtual field values are available:
- Physical fields are specified without the suffix, e.g.,
?myField
- Virtual fields are specified with the suffix, e.g.,
?myField/2
or?myField/alt
.
Note
Physical fields cannot be combined with parent()
as their values
come from different property chains. If you really need to filter
the same parent level, you can rewrite parent(?myField) in (<urn:x>, <urn:y>)
as parent(?myField/1) in (<urn:x>, <urn:y>) || parent(?myField/2)
in (<urn:x>, <urn:y>) || parent(?myField/3) ...
and surround it with
parentheses if it is a part of a bigger expression.
Datatype mapping¶
The Lucene GraphDB Connector maps different types of RDF values to different types of Lucene values according to the basic type of the RDF value (URI or literal) and the datatype of literals. The autodetection uses the following mapping:
RDF value | RDF datatype | Lucene type |
---|---|---|
URI | n/a | StringField |
literal | none | TextField |
literal | xsd:boolean |
StringField with values “true” and “false” |
literal | xsd:double |
DoubleField |
literal | xsd:float |
FloatField |
literal | xsd:long |
LongField |
literal | xsd:int |
IntField |
literal | xsd:dateTime |
DateTools.timeToString(), second precision |
literal | xsd:date |
DateTools.timeToString(), day precision |
The datatype mapping can be affected by the synchronisation options too,
e.g., a non-analysed field that has xsd:long
values is indexed with a
StringField
.
Note
For any given field the automatic mapping uses the first value it sees. This works fine for clean datasets but might lead to problems, if your dataset has non-normalised data, e.g., the first value has no datatype but other values have.
Advanced filtering and fine tuning¶
entityFilter
(string)- The
entityFilter
parameter is used to fine-tune the set of entities and/or individual values for the configured fields, based on the field value. Entities and field values are synchronised to Lucene if, and only if, they pass the filter. The entity filter is similar to aFILTER()
inside a SPARQL query but not exactly the same. Each configured field can be referred to, in the entity filter, by prefixing it with a?
, much like referring to a variable in SPARQL. Several operators are supported:
Operator | Meaning | Example |
---|---|---|
?var in (value1, value2, ...) |
Tests if the field var ‘s value is one of the specified values. Values that do not match, are treated as if they were not present in the repository. |
?status in ("active", "new") |
?var not in (value1, value2, ...) |
The negated version of the in-operator. | ?status not in ("archived") |
bound(?var) |
Tests if the field var has a valid value. This can be used to make the field compulsory. |
bound(?name) |
expr1 or expr2 |
Logical disjunction of expressions expr1 and expr2 . |
bound(?name) or bound(?company) |
expr1 && expr2 |
Logical conjunction of expressions expr1 and expr2 . |
bound(?status) && ?status in ("active", "new") |
!expr |
Logical negation of expression expr . |
!bound(?company) |
( expr ) |
Grouping of expressions | (bound(?name) or bound(?company)) && bound(?address) |
Note
?var in (...)
filters the values of?var
and leaves only the matching values, i.e., it will modify the actual data that will be synchronised to Lucenebound(?var)
checks if there is any valid value left after filtering operators such as?var in (...)
have been applied
In addition to the operators, there are some constructions that can be used to write filters based not on the values but on values related to them:
- Accessing the previous element in the chain
- The construction
parent(?var)
is used for going to a previous level in a property chain. It can be applied recursively as many times as needed, e.g.,parent(parent(parent(?var)))
goes back in the chain three times. The effective value ofparent(?var)
can be used with thein
ornot in
operator like this:parent(?company) in (<urn:a>, <urn:b>)
, or in thebound
operator like this:parent(bound(?var))
. - Accessing an element beyond the chain
The construction
?var -> uri
(alternatively,?var o uri
or just?var uri
) is used for accessing additional values that are accessible through the propertyuri
. In essence, this construction corresponds to the triple patternvalue
uri
?effectiveValue
, where?value
is a value bound by the fieldvar
. The effective value of?var -> uri
can be used with thein
ornot in
operator like this:?company -> rdf:type in (<urn:c>, <urn:d>)
. It can be combined withparent()
like this:parent(?company) -> rdf:type in (<urn:c>, <urn:d>)
. The same construction can be applied to thebound
operator like this:bound(?company -> <urn:hasBranch>)
, or even combined withparent()
like this:bound(parent(?company) -> <urn:hasGroup>)
.The URI parameter can be a full URI within
< >
or the special stringrdf:type
(alternatively, justtype
), which will be expanded tohttp://www.w3.org/1999/02/22-rdf-syntax-ns#type
.- Filtering by RDF graph
- The construction
graph(?var)
is used for accessing the RDF graph of a field’s value. The typical use case is to sync only explicit values:graph(?a) not in (<http://www.ontotext.com/implicit>)
. The construction can be combined withparent()
like this:graph(parent(?a)) in (<urn:a>)
. - Entity filters and default values
Entity filters can be combined with default values in order to get more flexible behaviour.
A typical use-case for an entity filter is having soft deletes, i.e., instead of deleting an entity, it is marked as deleted by the presence of a specific value for a given property.
Basic entity filter example¶
Given the following RDF data:
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.ontotext.com/example#> .
# the entity bellow will be synchronised because it has a matching value for city: ?city in ("London")
:alpha
rdf:type :gadget ;
:name "John Synced" ;
:city "London" .
# the entity below will not be synchronised because it lacks the property completely: bound(?city)
:beta
rdf:type :gadget ;
:name "Peter Syncfree" .
# the entity below will not be synchronised because it has a different city value:
# ?city in ("London") will remove the value "Liverpool" so bound(?city) will be false
:gamma
rdf:type :gadget ;
:name "Mary Syncless" ;
:city "Liverpool" .
If you create a connector instance such as:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
INSERT DATA {
inst:my_index :createConnector '''
{
"types": ["http://www.ontotext.com/example#gadget"],
"fields": [
{
"fieldName": "name",
"propertyChain": ["http://www.ontotext.com/example#name"]
},
{
"fieldName": "city",
"propertyChain": ["http://www.ontotext.com/example#city"]
}
],
"entityFilter":"bound(?city) && ?city in (\\"London\\")"
}
''' .
}
The entity :beta
is not synchronised as it has no value for city
.
To handle such cases, you can modify the connector configuration to
specify a default value for city
:
...
{
"fieldName": "city",
"propertyChain": ["http://www.ontotext.com/example#city"],
"defaultValue": "London"
}
...
}
The default value is used for the entity :beta
as it has no value for city
in the repository. As the value is “London”, the entity is synchronised.
Advanced entity filter example¶
Sometimes, data represented in RDF is not well suited to map directly to
non-RDF. For example, if you have news articles and they can be tagged
with different concepts (locations, persons, events, etc.), one possible
way to model this is a single property :taggedWith
. Consider the
following RDF data:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix : <http://www.ontotext.com/example2#> .
:Berlin
rdf:type :Location ;
rdfs:label "Berlin" .
:Mozart
rdf:type :Person ;
rdfs:label "Wolfgang Amadeus Mozart" .
:Einstein
rdf:type :Person ;
rdfs:label "Albert Einstein" .
:Cannes-FF
rdf:type :Event ;
rdfs:label "Cannes Film Festival" .
:Article1
rdf:type :Article ;
rdfs:comment "An article about a film about Einstein's life while he was a professor in Berlin." ;
:taggedWith :Berlin ;
:taggedWith :Einstein ;
:taggedWith :Cannes-FF .
:Article2
rdf:type :Article ;
rdfs:comment "An article about Berlin." ;
:taggedWith :Berlin .
:Article3
rdf:type :Article ;
rdfs:comment "An article about Mozart's life." ;
:taggedWith :Mozart .
:Article4
rdf:type :Article ;
rdfs:comment "An article about classical music in Berlin." ;
:taggedWith :Berlin ;
:taggedWith :Mozart .
:Article5
rdf:type :Article ;
rdfs:comment "A boring article that has no tags." .
:Article6
rdf:type :Article ;
rdfs:comment "An article about the Cannes Film Festival in 2013." ;
:taggedWith :Cannes-FF .
Now, if you map this data to Lucene so that the property :taggedWith
x
is mapped to separate fields taggedWithPerson
and
taggedWithLocation
according to the type of x
(we are not
interested in events), you can map taggedWith
twice to different fields
and then use an entity filter to get the desired values:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
INSERT DATA {
inst:my_index :createConnector '''
{
"types": ["http://www.ontotext.com/example2#Article"],
"fields": [
{
"fieldName": "comment",
"propertyChain": ["http://www.w3.org/2000/01/rdf-schema#comment"]
},
{
"fieldName": "taggedWithPerson",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"]
},
{
"fieldName": "taggedWithLocation",
"propertyChain": ["http://www.ontotext.com/example2#taggedWith"]
}
],
"entityFilter": "?taggedWithPerson type in (<http://www.ontotext.com/example2#Person>)
&& ?taggedWithLocation type in (<http://www.ontotext.com/example2#Location>)"
}
''' .
}
Note
type
is the short way to write <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
.
The six articles in the RDF data above will be mapped as such:
Article URI | Value in taggedWithPerson | Value in taggedWithLocation | Explanation |
---|---|---|---|
:Article1 |
:Einstein |
:Berlin |
:taggedWith has the values :Einstein , :Berlin and :Cannes-FF . The filter leaves only the correct values in the respective fields. The value :Cannes-FF is ignored as it does not match the filter. |
:Article2 |
:Berlin |
:taggedWith has the value :Berlin . After the filter is applied, only taggedWithLocation is populated. |
|
:Article3 |
:Mozart |
:taggedWith has the value :Mozart . After the filter is applied, only taggedWithPerson is populated |
|
:Article4 |
:Mozart |
:Berlin |
:taggedWith has the values :Berlin and :Mozart . The filter leaves only the correct values in the respective fields. |
:Article5 |
:taggedWith has no values. The filter is not relevant. |
||
:Article6 |
:taggedWith has the value :Cannes-FF . The filter removes it as it does not match. |
This can be checked by issuing a faceted search for taggedWithLocation
and taggedWithPerson
:
PREFIX : <http://www.ontotext.com/connectors/lucene#>
PREFIX inst: <http://www.ontotext.com/connectors/lucene/instance#>
SELECT ?facetName ?facetValue ?facetCount {
?search a inst:my_index ;
:facetFields "taggedWithLocation,taggedWithPerson" ;
:facets _:f .
_:f :facetName ?facetName ;
:facetValue ?facetValue ;
:facetCount ?facetCount .
}
If the filter was applied, you should get only :Berlin
for
taggedWithLocation
and only :Einstein
and :Mozart
for taggedWithPerson
:
?facetName |
?facetValue |
?facetCount |
---|---|---|
taggedWithLocation |
http://www.ontotext.com/example2#Berlin |
3 |
taggedWithPerson |
http://www.ontotext.com/example2#Mozart |
2 |
taggedWithPerson |
http://www.ontotext.com/example2#Einstein |
1 |
Overview of connector predicates¶
The following diagram shows a summary of all predicates that can
administer (create, drop, check status) connector instances or issue
queries and retrieve results. It can be used as a quick reference of
what a particular predicate needs to be attached to. For example, to
retrieve entities, you need to use :entities
on a search instance and to
retrieve snippets, you need to use :snippets
on an entity. Variables
that are bound as a result of a query are shown in green, blank helper
nodes are shown in blue, literals in red, and URIs in orange. The
predicates are represented by labelled arrows.
Caveats¶
Order of control¶
Even though SPARQL per se is not sensitive to the order of triple patterns, the Lucene GraphDB Connector expects to receive certain predicates before others so that queries can be executed properly. In particular, predicates that specify the query or query options need to come before any predicates that fetch results.
The diagram in Overview of connector predicates provides a quick overview of the predicates.
Upgrading from previous versions¶
Migrating from GraphDB 6.2 to 6.6¶
There are no new connector options in GraphDB 7.
The Lucene Connector in GraphDB 6.2 to 6.6 uses Lucene 4.x and the Lucene Connector in GraphDB 7 uses Lucene 5.x. GraphDB 7 can use connector instances created with GraphDB 6.2 to 6.6 with the following exception:
- Fields used for sorting (orderBy predicate) need to be declared with multivalued = false now. If you use orderBy you have to recreate your connector instances.
We recommend to drop any existing instances and recreate them to benefit from any performance improvements in Lucene 5.x even if you do not have any orderBy’s in your queries.
Migrating from a pre-6.2 version¶
GraphDB prior to 6.2 shipped with version 3.x of the Lucene GraphDB Connector that had different options and slightly different behaviour and internals. Unfortunately, it is not possible to migrate existing connector instances automatically. To prevent any data loss, the Lucene GraphDB Connector will not initialise, if it detects an existing connector in the old format. The recommended way to migrate your existing instances is:
- Backup the
INSERT
statement used to create the connector instance. - Drop the connector.
- Deploy the new GraphDB version.
- Modify the
INSERT
statement according to the changes described below. - Re-create the connector instance with the modified
INSERT
statement.
You might also need to change your queries to reflect any changes in field names or extra fields.
Changes in field configuration and synchronisation¶
Prior to 6.2, a single field in the config could produce up to three individual fields on the Lucene side, based on the field options. For example, for the field “firstName”:
field | note |
---|---|
firstName | produced, if the option “index” was true; used explicitly in queries |
_facet_firstName | produced, if the option “facet” was true; used implicitly for facet search |
_sort_firstName | produced, if the option “sort” was true; used implicitly for ordering connector results |
The current version always produces a single Lucene field per field definition in the configuration. This means that you have to create all appropriate fields based on your needs. See more in List of creation parameters.
Tip
To mimic the functionality of the old _sort_fieldName
fields, you
can either create a non-analysed
Copy fields (for textual fields) or
just use the normal field (for non-textual fields).