Semantic similarity searches

Why do I need similarity plugin?

The similarity plugin allows exploring and searching semantic similarity in RDF resources.

Often the users need to solve also cases where statistical semantics queries will be highly valuable like: For this text (encoded as a literal in the database) return the closest texts based on a vector space model.

Another type of use case is the clustering news (from a news feed) in groups by discussing events.

What the similarity plugin does?

Humans determine the similarity between texts based and the similarity of the composing words and their abstract meaning. Documents composed by similar words are semantically related and words frequently co-occurring are also considered close. The plugin supports document and term searches. A document is a literal or an aggregation of multiple literals and a term is a word from the documents.

There are four type of similarity searches:

  • Term to term - returns the closest semantic related terms
  • Term to document - returns the most representative documents for a specific searched term
  • Document to term - returns the most representative terms for a specific document
  • Document to document - returns the closest related texts

How the similarity plugin works?

The similarity plugin integrates the semantic vectors library and the underlying Random Indexing algorithm. The algorithm uses a tokenizer to translate documents to sequences of words (terms) and represent them into a vector space model representing their abstract meaning. A distinctive feature of the algorithm is the dimensionality reduction approach based on Random Projection, where the initial vector state is generated randomly. With the indexing of each document, the term vectors are adjusted based on the contextual words. This approach makes the algorithm highly scalable for very large text corpus of documents and research papers have proven that its efficiency is comparable to more sound dimensionality reduction algorithms like singular value decomposition.

Search similar terms

The example shows terms similar to “novichok” in the search index allNews. The term “novichok” is used in the search field. The selected option for both Search type and Result type is Term. Sample results of terms similar to “novichok” - listed by their score - are given below.


Search documents for which selected term is specific

The term “novichok” is used as an example again. The selected option for Search type is Term, and for Result type is Document. Sample results of the most representative documents for a specific searched term - listed by their score - are given below.


Search specific terms in selected document

The result with the highest score from the previous search is used in the new search. The selected option for Search type is Document, and for Result type is Term. Sample results of the most representative terms - listed by their score - are given below.


Search for closest documents

A search for the texts closest to the selected document is also possible. The same document is used in the search field. Sample results of the documents with the closest texts to the selected document - listed by their score - are given below. The titles of the documents prove that their content is similar though the sources are different.


Download data

For the sample results listed above to be received, it is necessary to download data and create an index. Data from is used in the following example. News from January to April 2018, together with their content, creationDate and mentionsEntity triples are downloaded.

Go to the SPARQL editor at and write the following query:

PREFIX pubo: <>
PREFIX pub: <>
PREFIX dbr: <>
PREFIX xsd: <>
PREFIX ff-map: <>

        ?document ff-map:mentionsEntity ?entity .
        ?document pubo:content ?content .
        ?document pubo:creationDate ?date .
        ?document a pubo:Document .
        ?document ff-map:mentionsEntity ?entity .
        ?document pubo:content ?content .
        ?document pubo:creationDate ?date .
    FILTER (?p NOT IN (pubo:containsMention, pubo:hasFeature, pubo:hasImage))
    FILTER ( (?date > "2018-01-01"^^xsd:dateTime) && (?date < "2018-04-30"^^xsd:dateTime))

Download the data using the Download As button - choose Turtle option. It will take some time to export the data to query-result.ttl file.

Go to your local GraphDB instance and create a new repository “news”.

Move the downloaded file to <HOME>/graphdb-import folder to be visible in import->RDF->Server files.

Import query-result.ttl file in your new repository “news”.

Go to Settings, enable Autocomplete index and create an index for allNews, using Build Now button.

Autocomplete index is used for autocompletion of URLs in the SPARQL editor and the View resource page.


Create similarity index

Create index for allNews. Similarity indexes help you look up semantically similar entities and texts.

Go to Explore -> Similarity -> Create similarity index and change the query to:

PREFIX pubo: <>
SELECT ?documentID ?documentText
    ?documentID pubo:content ?documentText .
Please note that there are default parameters:
-termweight idf

This will index allNews content where the ID of a document is the news’ IRI and the text is the news’ content.

Name the index ‘allNews’, save it and wait until it’s ready. Using {...} button you can review or copy SPARQL Query for this index.

Create index parameters

  • -vectortype - real, complex, binary - Real, Comlplex and Binary Semantic Vectors
  • -dimension - dimension of semantic vector space, default value 200. Recommended values are in the hundreds for real and complex and in the thousands for binary, since binary dimensions are single bits. Smaller dimensions make both indexing and queries faster, but if the dimension is too low, then the orthogonality of the element vectors will be compromised leading to poorer results. An intuition for the optimal values is given by the Johnson–Lindenstrauss lemma
  • -seedlength - Number of nonzero entries in a sparse random vector, default value 10 except for when vectortype is binary, in which case default of dimension / 2 is enforced. For real and complex vectors default value is 10, but it’s a good idea to use a higher value when the vector dimension is higher than 200. Simplest thing to do is to preserve this ratio, i.e. to divide the dimension by 20. It’s worth mentioning that in the original implementation of random indexing, the ratio of non-zero elements was 1/3.
  • -trainingcycles - Number of training cycles used for Reflective Random Indexing
  • -termweight - Term weighting used when constructing document vectors. Values can be none, idf, logentropy, sqrt. It is a good idea to use term weighting when building indexes so we add -termweight idf as a default when creating an index. It uses inverse document frequency when building the vectors. See LuceneUtils for more details.
  • -minfrequency - Minimum number of times that a term has to occur in order to be indexed. Default value is set to 0, but it would be a bad idea to use it, as that would add a lot of big numbers/weird terms/misspelled words to the list of word vectors. Best approach would be to set it as a fraction of the total word count in the corpus. For example 40 per million as a frequency threshold. Another approach is to start with an intuitive value, a single digit number like 3-4, and start fine tuning from there.
  • -maxfrequency - Maximum number of times that a term can occur before getting removed from indexes. Default value is Integer.MAX_VALUE. Again, a better approach is to calculate it as a percentage of the total word count. Otherwise, you can use the default value and add most common english words to the stoplist.
  • -maxnonalphabetchars - Maximum number of non alphabet characters a term can contain in order to be indexed. Default value is Integer.MAX_VALUE. Recommended values depend on the dataset and the type of terms it contains, but setting it to 0 works pretty well for most basic cases, as it takes care of punctuation (if data has not been preprocessed), malformed terms, and weird codes and abbreviations.
  • -filternumbers - true/false, index numbers or not
  • -mintermlength - Minimum number of characters in a term
  • -porterstemmer - To stem words using the Porter Stemmer or not. Note that it is not possible to pass the same parameter during search. So you should take this into account when searching your terms after that to stem them prior to searching.
  • -indexfileformat - Format used for serializing / deserializing vectors from disk, default lucene. Other option is text, may be used for debug to see the actual vectors. Too slow on real data.

Disabled parameters

  • -luceneindexpath - Currently, you are not allowed to build your own lucene index and create vectors from it since index + vectors creation is all done in one step.
  • -stoplistfile - replaced by <> predicate. Stop words are passed as a string literal, not a file.
  • -elementalmethod
  • -docindexing

Search in the index

Go to the list of indexes and click on allNews index. For search options, select Search type to be either Term or Document. Result type can be either Term or Document.


Search parameters

  • -searchtype - Different types of searches that can be performed. Most involve processing combinations of vectors in different ways, in building a query expression, scoring candidates against these query expressions, or both. Default is sum that builds a query by adding together (weighted) vectors for each of the query terms, and search using cosine similarity. See SearchType in
  • -matchcase - If true, matching of query terms is case-sensitive; otherwise case-insensitive, default false.
  • -numsearchresults - number of search results

Additional info at:

Search similar news within days

The search can be extended by using the power of SPARQL to find the same news by different sources. This can be done by filtering all the news from the results related to a given period.

Click on View SPARQL Query, copy the query and go to the SPARQL editor to paste it there. Now you can integrate statistic similarity with RDF to obtain the following query:

# Search similar news (SPARQL) within days

PREFIX inst:<>
PREFIX pubo: <>
PREFIX xsd: <>
SELECT ?entity ?score ?matchDate ?searchDate  {
    BIND (<> as ?searchDocumentID)
       ?search a inst:allNews ;
          :searchDocumentID ?searchDocumentID;
        :searchParameters "";
        :documentResult ?result .
        ?result :value ?entity ;
        :score ?score .
        ?entity pubo:creationDate ?matchDate  .
        ?searchDocumentID pubo:creationDate ?searchDate .
    FILTER (?matchDate > ?searchDate - "P2D"^^xsd:duration && ?matchDate < ?searchDate + "P2D"^^xsd:duration)

Search for similar news, get their creationDate and filter only the news within 2 days.