Full-text Search¶
What’s in this document?
Full-text search (FTS) indexing enables very fast queries over textual data. Typically, FTS is used to retrieve data that represents text written in a human language such as English, Spanish, or French.
GraphDB supports various mechanisms for performing full-text search depending on the use case and the needs of a given project.
FTS using the GraphDB connectors¶
The GraphDB connectors index, search, and retrieve entire documents composed of a set of RDF statements:
They need a predefined data model that describes how every indexed document is constructed from a template of RDF statements.
Queries search in one or more document fields.
Results return the document ID.
See more about the full-text search with the GraphDB connectors, as well as the Lucene connector, the Solr connector, and the Elasticsearch connector.
Simple FTS index¶
GraphDB 10.1 introduced a simple FTS index that covers some basic FTS use cases. This index contains literals and IRIs:
There is no data model, so it is easy to set up.
Queries search in literals and IRIs.
Results return the matching literals and IRIs.
How the search works¶
In general, searching is performed via SPARQL using a pattern like this:
?value onto:fts (query index limit)
There are three search arguments:
The query: string or language-tagged string, required
The index to search: string, optional
The limit of the search: integer, optional
The matching values will be returned as bindings of the provided variable, ?value
in the model above.
When no index is supplied as a parameter, the index will be determined as such:
If the query is a plain string without a language tag, then the index will be the configured index for string literals (via the Enable full-text search (FTS) index repository configuration parameter).
If the query is a language-tagged string, then the language tag will be used to determine the index name.
Note
When an index is supplied as a parameter, the language tag of the query string will be ignored.
When only the query is provided (the only required argument), it is possible and recommended to provide it directly without constructing an RDF list. Thus, the pattern can be simplified to:
?value onto:fts query
Some examples:
("query" "en" 10)
: Search for “query” in the “en” index and limit results to 10.("query" 15)
: Search for “query” in the index configured via fts-string-literals-index and limit results to 15.("query"@de 20)
: Search for “query” in the “de” index and limit results to 20.("query"@de-CH 20)
: Search for “query” in the “de” index and limit results to 20. Note that only the language part of the tag de-CH determines the index.("query" "fr")
: Search for “query” in the “fr” index and do not apply a limit."query"@fr
: Search for “query” in the “fr” index and do not apply a limit – when a sole argument is provided, it does not need to be inside an RDF list.
Query syntax¶
The queries are parsed using Lucene’s StandardQueryParser class.
A query consists of clauses, field specifications, grouping and Boolean operators, and interval functions.
Note
Keep in mind these details in particular:
Field specifications: There are no other field names but the default field name so there is no valid case where the user must specify a field name.
Escaping in SPARQL: All query syntax examples specify the expected Lucene query string. If you provide these strings as SPARQL literals, you may need to escape
"
and\
as required by SPARQL.
Note
Some of the specialized query types are not text-analyzed. Lexical analysis is only run on complete terms, i.e., a term/phrase query. Query types containing incomplete terms (e.g., prefix/wildcard/regex/fuzzy query) skip the analysis stage and are directly added to the query tree. The only transformation applied to partial query terms is lowercasing.
This may lead to surprising results if you expect stemming or lemmatization. For example, searching for “resti*” and expecting to find “resting” will not work when using the English analyzer since the word “resting” was analyzed and indexed as “rest”.
Basic clauses¶
A query must contain one or more clauses. A clause can be a literal term, a phrase, a wildcard expression, or any supported expression.
The following are some examples of simple one-clause queries:
Query |
Description |
---|---|
|
Selects documents containing the word “test” (term clause). |
|
Phrase search; selects documents containing the phrase “test equipment” (phrase clause). |
|
Proximity search; selects documents containing the words “test” and “failure” within 4 words (positions) from each other. The provided “proximity” is technically translated into “edit distance” (maximum number of atomic word-moving operations required to transform the document’s phrase into the query phrase). |
|
Prefix wildcard matching; selects documents containing words starting with “tes”, such as: “test”, “testing” or “testable”. |
|
Documents containing word roots matching the provided regular expression, such as “post” or “nest”. |
|
Fuzzy term matching; documents containing words within 2-edits distance (2 additions, removals, or replacements of a letter) from “nest”, such as “test”, “net”, or “rests”. |
Boolean operators and grouping¶
You can combine clauses using Boolean AND, OR, and NOT operators to form more complex expressions, for example:
Query |
Description |
---|---|
|
Selects documents containing both the word “test” and the word “results”. |
|
Selects documents with at least one of “test”, “suite”, or “results”. |
|
Selects documents containing “test” and not containing “complete”. |
|
Grouping; use parentheses to specify the precedence of terms in a Boolean clause. Query will match documents containing “test” and a word starting with “pass” or “fail”. |
|
Shorthand notation; documents containing at least one of “pass”, “fail”, or “skip”. |
Note
The Boolean operators must be written in all caps, otherwise they are parsed as regular terms.
Range operators¶
To search for ranges of textual or numeric values, use square or curly brackets, for example:
Query |
Description |
---|---|
|
Inclusive range; selects documents that contain any value between “Jones” and “Smith”, including boundaries. |
|
Exclusive range; selects documents that contain any value between “Jones” and “Smith”, excluding boundaries. |
|
One-sided range; selects documents that contain any value larger than (i.e., sorted after) “Jones”. |
Note
These will work intuitively only with the “iri” index, e.g., "[http://www.w3.org/2000/01/rdf-schema#comment TO http://www.w3.org/2000/01/rdf-schema#range]"
will retrieve all IRIs that are alphabetically ordered between http://www.w3.org/2000/01/rdf-schema#comment
and http://www.w3.org/2000/01/rdf-schema#range
inclusive. If used with any of the other indexes, they will return matches but it will not be intuitive what they match.
Term boosting¶
Terms, quoted terms, term range expressions, and grouped clauses can have a floating-point weight boost applied to them to increase their score relative to other clauses. For example:
Query |
Description |
---|---|
|
Prioritize documents with “jones” term over matches on the “smith” term. |
|
Apply the boost to a sub-query. |
Special character escaping¶
Most search terms can be put in double quotes, making special character escaping not necessary. If the search term contains the quote character (or cannot be quoted for some reason), any character can be quoted with a backslash. For example:
Query |
Description |
---|---|
|
A single search term |
Minimum-should-match constraint for Boolean disjunction groups¶
A minimum-should-match operator can be applied to a disjunction Boolean query (a query with only “OR”-subclauses) and forces the query to match documents with at least the provided number of these subclauses. For example:
Query |
Description |
---|---|
|
Matches all documents with at least two terms from the set [blue, crab, fish] (in any order). |
|
Sub-clauses of a Boolean query can themselves be complex queries; here the min-should-match selects documents that match at least two of the provided three sub-clauses. |
Interval function clauses¶
Interval functions are a powerful tool for expressing search needs in terms of one or more * contiguous fragments of text and their relationship to one another. All interval clauses start with the fn:
prefix. For example:
Query |
Description |
---|---|
|
Matches all documents with at least one ordered sequence of “quick”, “brown”, and “fox” terms. |
|
Matches all documents where at least two of the three terms “quick”, “brown”, and “fox” occur within five positions of each other. |
Common use cases¶
The first thing we need to do in order to perform full-text search is to enable the FTS index. This can be done at repository creation by setting the Enable full-text search (FTS) index to true
, as well as at a later stage if you want to edit the repository configuration.

Single language¶
Let’s say that our data is in a single supported language and we want to perform full-text search in order to find literals that match. Literals may or may not have a language tag, for example:
“This is a literal in English without a language tag”
“This is another literal in English with a language tag for the language only”@en
“This is yet another literal tagged for English in Canada”@en-CA
To configure the search:
Create a repository.
In its configuration menu, enable the
"en"
index by setting FTS indexes to build to “en”.The literals without a language tag need to go into the
"en"
index too, so we will set FTS index for xsd:string literals to “en”.

Important
After each change applied to any of the FTS parameters, you need to restart the repository.
In the Workbench SPARQL editor, let’s insert the following sample data:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
INSERT DATA {
<urn:d1> rdfs:label "This is a literal in English without a language tag",
"This is another literal in English with a language tag for the language only"@en,
"This is yet another literal tagged for English in Canada"@en-CA,
"Let's pretend this literal isn't in English by tagging it as German"@de
}
So if we run the following example query against it:
PREFIX onto: <http://www.ontotext.com/>
select * {
# Note that this exploits the fact that we haven’t enabled the default index,
# so the index for indexing string literals (en) is the default query index
?value onto:fts "english literal"
}
Or this one:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The language tag of the query literal supplies the index to query
?value onto:fts "english literal"@en
}
Or this one:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The query string and the index to query are supplied as two separate values
# inside an RDF list
?value onto:fts ("english literal" "en")
}
They will all return the first three literals (i.e., without the one tagged as German).

Multiple languages¶
Here, our data is in several supported languages (e.g., English and German) and we want to perform full-text search in order to find literals that match. Literals without a language tag are in one of the desired languages (e.g., English). The data may look like this:
“This is a literal in English without a language tag”
“This is another literal in English with a language tag for the language only”@en
“This is yet another literal tagged for English in Canada”@en-CA
“Das ist ein schönes deutsches Literal”@de
“Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz”@de-CH
To configure the search:
Create a repository.
In its configuration menu, enable the
"en"
and"de"
indexes by setting FTS indexes to build to “en, de”. This can be extended with additional languages by adding them to the list.The literals without a language tag need to go into the
"en"
index too, so we will set FTS index for xsd:string literals to “en”.

We will use the following sample data:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
INSERT DATA {
<urn:d2> rdfs:label "This is a literal in English without a language tag",
"This is another literal in English with a language tag for the language only"@en,
"This is yet another literal tagged for English in Canada"@en-CA,
"Das ist ein schönes deutsches Literal"@de,
"Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz"@de-CH
}
Searching in English is exactly the same as in the first use case. To search the additional German index, we must always specify it like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The language tag of the query literal supplies the index to query
?value onto:fts "deutsch literal"@de
}
Or this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The query string and the index to query are supplied as two separate values
# inside an RDF list
?value onto:fts ("deutsch literal" "de")
}
Both of these queries will return the two German literals.

Note
Keep in mind that if you have other data in the repository, it may affect the results.
Ignore untagged literals¶
In this case, our data is in one or more supported languages (e.g., English and German) and we want to perform full-text search in order to find literals that match. Literals without a language tag should not be treated as any of those languages and need not be searched. Data may look like this:
“This is another literal in English with a language tag for the language only”@en
“This is yet another literal tagged for English in Canada”@en-CA
“Das ist ein schönes deutsches Literal”@de
“Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz”@de-CH
“This is a literal in English without a language tag” (this must not be indexed)
To configure the search:
Create a repository.
In its configuration menu, enable the
"en"
and"de"
indexes by setting FTS indexes to build to “en, de”. This can be extended with additional languages by adding them to the list.The literals without a language tag need to not be indexed, so we will set FTS index for xsd:string literals to “none”.

Let’s insert the following sample data:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
INSERT DATA {
<urn:d3> rdfs:label "This is another literal in English with a language tag for the language only"@en,
"This is yet another literal tagged for English in Canada"@en-CA,
"Das ist ein schönes deutsches Literal"@de,
"Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz"@de-CH,
"This is a literal in English without a language tag"
}
Searching in any of the languages requires to specify the index (there is no default search index because FTS index for xsd:string literals is set to “none”), so like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The language tag of the query literal supplies the index to query
?value onto:fts "english literal"@en
}
Or this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The query string and the index to query are supplied as two separate values
# inside an RDF list
?value onto:fts ("english literal" "en")
}
Both queries will return the two literals that are tagged for English but not the untagged one.

Untagged literals not treated as any language but still searchable¶
Here, our data is in one or more supported languages (e.g., English and German) and we want to perform full-text search in order to find literals that match.
Literals without a language tag should not be treated as any of those languages but should provide language-agnostic full-text search. These literals may be data like UUIDs or anything else that has a textual representation that we may want to search. Data may look like this:
“This is another literal in English with a language tag for the language only”@en
“This is yet another literal tagged for English in Canada”@en-CA
“Das ist ein schönes deutsches Literal”@de
“Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz”@de-CH
“96ac1c60-7997-45a3-8dfe-b57b24c1cb62” (this will be indexed separately)
To configure the search:
Create a repository.
In its configuration menu, enable the
"default"
index, as well as the indexes"en"
and"de"
, by setting FTS indexes to build to “default, en, de”. This can be extended with additional languages by adding them to the list.The literals without a language tag need to be indexed in a language-agnostic manner, so we will set FTS index for xsd:string literals to “default” (which is also the default value for that repository configuration property).

Important
The values of FTS indexes to build must contain the values for FTS index for xsd:string literals and FTS index for full-text indexing of IRIs, unless those are set to “none”.
Let’s import the following data:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
INSERT DATA {
<urn:d4> rdfs:label "This is another literal in English with a language tag for the language only"@en,
"This is yet another literal tagged for English in Canada"@en-CA,
"Das ist ein schönes deutsches Literal"@de,
"Dies hier ist ebenso ein hübsches deutsches Literal, aber aus der Schweiz"@de-CH,
"96ac1c60-7997-45a3-8dfe-b57b24c1cb62"
}
Note
The “default”
index provides language-agnostic search.
Searching in any of the languages is like in the third example related to ignoring untagged literals, i.e., you need to provide the index to search.
Searching in the untagged literals can be done like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "b57*"
}
Or like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The language tag of the query literal supplies the index to query
?value onto:fts "b57*"@default
}
Or like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# The query string and the index to query are supplied as two separate values
# inside an RDF list
?value onto:fts ("b57*" "default")
}
All of these queries will return the single untagged literal where "b57*"
was matched to one of the hyphenated components.

Treat IRIs as keywords and search them¶
In this case, regardless of our need to search literals, we also want to search within IRIs treating them as keywords (the entire IRI is considered a single searchable token). These can be any IRIs, such as:
<http://www.w3.org/2000/01/rdf-schema#domain>
<http://example.com/data/john>
<http://example.com/data/mary>
<http://exampel.com/data/william>
To configure the search:
Create a repository.
In its configuration menu, enable a special index called
"iri"
by adding it to the FTS indexes to build property. For example, if we also want English literals to be indexed, we will set FTS indexes to build to “en, iri”.Set FTS index for xsd:string literals to “en” so that the literals without a language tag will go to the “en” index.

Let’s insert the following data:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
INSERT DATA {
<http://example.com/data/john> rdfs:label "John" .
<http://example.com/data/mary> rdfs:label "Mary" .
<http://example.com/data/william> rdfs:label "William" .
}
To search the IRIs, you need to query the "iri"
index like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# Finds all IRIs that start with "http://example.com/"
?value onto:fts "http://example.com/*"@iri
}
Or like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# Finds all IRIs that start with "http://example.com/"
?value onto:fts ("http://example.com/*" "iri")
}
Both of these will return the http://example.com/xxx
IRIs from the sample data.

When the entire search string is a single keyword, which is the case for the "iri"
index, you can also use range searches to find IRIs that sort between two IRIs:
PREFIX onto: <http://www.ontotext.com/>
select * {
# Finds all IRIs that sort between http://example.com/data/kelly and http://example.com/data/william,
# including the boundaries
?value onto:fts "[http://example.com/data/kelly TO http://example.com/data/william]"@iri
}
Or like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
# Finds all IRIs that sort between http://example.com/data/kelly and http://example.com/data/william,
# including the boundaries
?value onto:fts "[http://example.com/data/kelly TO http://example.com/data/william]"@iri
}
Both of these queries should return http://example.com/data/mary
and http://example.com/data/william
.

Indexing¶
In this scenario, regardless of our need to search literals, we also need to search within IRIs, treating them as regular text (the IRI is split into multiple searchable tokens). These are typically IRIs that are readable and are composed of words:
<http://example.com/data/john>
<http://example.com/data/mary>
<http://exampel.com/data/william>
To configure the search:
Create a repository.
In its configuration menu, enable the index for the language we want by adding it to FTS indexes to build – for English, we will set FTS indexes to build to “en”.
The value of FTS index for xsd:string literals must also be set to “en”.
We also need IRIs to be indexed for full-text search in the language we enabled, so we will set FTS index for full-text indexing of IRIs to “en”.

Let’s insert the sample data:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
INSERT DATA {
<http://example.com/data/john> rdfs:label "John" .
<http://example.com/data/mary> rdfs:label "Mary" .
<http://example.com/data/william> rdfs:label "William" .
}
IRIs are then searchable in the "en"
index just like literals:
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "john"@en
}
Or like this:
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts ("john" "en")
}
Both of these queries will return the IRI http://example.com/john
, as well as the literal "John"
.

Star Wars dataset examples¶
These examples use the Star Wars dataset from starwars-data.ttl
.
Create the repository as follows:
Enable full-text search (FTS) index: true
FTS indexes to build: en, de, fr, es, it
FTS index for xsd:string literals: en
FTS index for full-text indexing of IRIs: none (but “en” would also make sense for this dataset)

Let’s look at some example queries below.
All literals where “luke” and “vader” are near each other¶
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts '"luke vader"~5'
}
It returns a single literal.

Note that the above searches in the "en"
index since the default index is disabled and we requested xsd:string literals to go to the "en"
index.
Note that we use single quotes for the query literal to avoid escaping the double quotes that are part of the full-text search query.
All literals containing “skywalker” but not “luke”¶
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "skywalker -luke"
}
It returns several results, some of which are Luke’s grandmother Shmi Skywalker and Luke’s father Anakin Skywalker (before he became Darth Vader).

All literals corresponding to a simple FTS query¶
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "striking jedis"
}
It returns many results, some of which are “The Empire Strikes Back” and “Return of the Jedi”. This illustrates how full-text search tuned to a specific language (in this case English) is able to match “striking” to “strikes” and “jedis” to “jedi”.

Note that the query written like that does not need all tokens to be in the matched result, or in other words the query is equivalent to “striking OR jedis”.
All literals corresponding to a simple FTS query in German¶
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "das beste"@de
}
It returns matches like “Ahmed Best”, “Oscar für den besten Film” and “Oscar für die beste Regie”, again illustrating the ability of FTS to match different word forms in German.

All literals corresponding to a simple FTS query in French¶
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "oscar acteur"@fr
}
It returns matches like “Oscar de la meilleure actrice” and “Oscar du meilleur acteur”, again illustrating the ability of FTS to match different word forms in French.

All literals corresponding to a simple FTS query in Italian¶
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "migliori"@it
}
It returns matches like “Oscar al miglior film”, “Oscar ai migliori costumi” and “Oscar alla migliore scenografia”, again illustrating the ability of FTS to match different word forms in Italian.

All literals corresponding to a simple FTS query in Spanish¶
PREFIX onto: <http://www.ontotext.com/>
select * {
?value onto:fts "peliculas"@es
}
It returns matches like “Película del 2005” and “personaje de ficción el las películas de Star Wars”, again illustrating the ability of FTS to match different word forms in Spanish but also the ability to ignore diacritics when searching.
