# Semantic similarity searches¶

## Why do I need the similarity plugin?¶

The similarity plugin allows exploring and searching semantic similarity in RDF resources.

As a user, you may want to solve cases where statistical semantics queries will be highly valuable, for example:

For this text (encoded as a literal in the database), return the closest texts based on a vector space model.

Another type of use case is the clustering of news (from a news feed) into groups by discussing events.

## What the similarity plugin does?¶

Humans determine the similarity between texts based on the similarity of the composing words and their abstract meaning. Documents containing by similar words are semantically related, and words frequently co-occurring are also considered close. The plugin supports document and term searches. A document is a literal or an aggregation of multiple literals, and a term is a word from a document.

There are four types of similarity searches:

• Term to term - returns the closest semantically related terms

• Term to document - returns the most representative documents for a specific searched term

• Document to term - returns the most representative terms for a specific document

• Document to document - returns the closest related texts

## How the similarity plugin works?¶

The similarity plugin integrates the semantic vectors library and the underlying Random Indexing algorithm. The algorithm uses a tokenizer to translate documents to sequences of words (terms) and to represent them into a vector space model representing their abstract meaning. A distinctive feature of the algorithm is the dimensionality reduction approach based on Random Projection, where the initial vector state is generated randomly. With the indexing of each document, the term vectors are adjusted based on the contextual words. This approach makes the algorithm highly scalable for very large text corpora of documents, and research papers have proven that its efficiency is comparable to more sound dimensionality reduction algorithms like singular value decomposition.

### Search similar terms¶

The example shows terms similar to “novichok” in the search index allNews. The term “novichok” is used in the search field. The selected option for both Search type and Result type is Term. Sample results of terms similar to “novichok”, listed by their score, are given below.

### Search documents for which selected term is specific¶

The term “novichok” is used as an example again. The selected option for Search type is Term, and for Result type is Document. Sample results of the most representative documents for a specific searched term, listed by their score, are given below.

### Search specific terms in selected document¶

The result with the highest score from the previous search is used in the new search. The selected option for Search type is Document, and for Result type is Term. Sample results of the most representative terms, listed by their score, are given below.

### Search for closest documents¶

A search for the texts closest to the selected document is also possible. The same document is used in the search field. Sample results of the documents with the closest texts to the selected document - listed by their score - are given below. The titles of the documents prove that their content is similar, even though the sources are different.

To obtain the sample results listed above, you need to download data and create an index. Data from factforge.net is used in the following example. News from January to April 2018, together with their content, creationDate, and mentionsEntity triples, are downloaded.

Go to the SPARQL editor at http://factforge.net/sparql and insert the following query:

PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX pub: <http://ontology.ontotext.com/taxonomy/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>

CONSTRUCT {
?document ff-map:mentionsEntity ?entity .
?document pubo:content ?content .
?document pubo:creationDate ?date .
} WHERE {
?document a pubo:Document .
?document ff-map:mentionsEntity ?entity .
?document pubo:content ?content .
?document pubo:creationDate ?date .
FILTER (?p NOT IN (pubo:containsMention, pubo:hasFeature, pubo:hasImage))
FILTER ( (?date > "2018-01-01"^^xsd:dateTime) && (?date < "2018-04-30"^^xsd:dateTime))
}


Download the data using the Download As button, choosing the Turtle option. It will take some time to export the data to the query-result.ttl file.

Go to your local GraphDB instance and create a new repository called “news”.

Move the downloaded file to the <HOME>/graphdb-import folder so that it is visible in Import ‣ RDF ‣ Server files (see here how to import server files).

Import the query-result.ttl file in your “news” repository.

Go to Setup and enable the Autocomplete index, and create an index for the “news” repository.

The Autocomplete index is used for autocompletion of URLs in the SPARQL editor and the View resource page.

## Text based similarity searches¶

### Create text similarity index¶

Create an index in the following way:

Go to Explore –> Similarity –> Create similarity index and change the search query to:

PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?documentID ?documentText
{
?documentID pubo:content ?documentText .
FILTER(isLiteral(?documentText))
}


Please note that there are default parameters: -termweight idf.

This will index the allNews content, where the ID of a document is the news piece’s IRI, and the text is the content.

Name the index allNews, save it and wait until it is ready.

Once the index has been created, you can see the following options on the right:

• With the {…} button, you can review or copy the SPARQL query for this index;

• The Edit icon allows you to modify the search query without having to build an index;

• You can also create a new index from an existing one;

• Rebuild the index;

• As well as delete it.

### Create index parameters¶

• -vectortype: Real, Complex, and Binary Semantic Vectors

• -dimension: Dimension of semantic vector space, default value 200. Recommended values are in the hundreds for real and complex, and in the thousands for binary, since binary dimensions are single bits. Smaller dimensions make both indexing and queries faster, but if the dimension is too low, then the orthogonality of the element vectors will be compromised leading to poorer results. An intuition for the optimal values is given by the Johnson–Lindenstrauss lemma.

• -seedlength: Number of nonzero entries in a sparse random vector, default value 10 except for when vectortype is binary, in which case default of dimension / 2 is enforced. For real and complex vectors default value is 10, but it is a good idea to use a higher value when the vector dimension is higher than 200. Simplest thing to do is to preserve this ratio, i.e., to divide the dimension by 20. It is worth mentioning that in the original implementation of random indexing, the ratio of non-zero elements was 1/3.

• -trainingcycles: Number of training cycles used for Reflective Random Indexing.

• -termweight: Term weighting used when constructing document vectors. Values can be none, idf, logentropy, sqrt. It is a good idea to use term weighting when building indexes so we add -termweight idf as a default when creating an index. It uses inverse document frequency when building the vectors. See LuceneUtils for more details.

• -minfrequency: Minimum number of times that a term has to occur in order to be indexed. Default value is set to 0, but it would be a bad idea to use it, as that would add a lot of big numbers/weird terms/misspelled words to the list of word vectors. Best approach would be to set it as a fraction of the total word count in the corpus. For example 40 per million as a frequency threshold. Another approach is to start with an intuitive value, a single digit number like 3-4, and start fine tuning from there.

• -maxfrequency: Maximum number of times that a term can occur before getting removed from indexes. Default value is Integer.MAX_VALUE. Again, a better approach is to calculate it as a percentage of the total word count. Otherwise, you can use the default value and add most common English words to the stop list.

• -maxnonalphabetchars: Maximum number of non alphabet characters a term can contain in order to be indexed. Default value is Integer.MAX_VALUE. Recommended values depend on the dataset and the type of terms it contains, but setting it to 0 works pretty well for most basic cases, as it takes care of punctuation (if data has not been preprocessed), malformed terms, and weird codes and abbreviations.

• -filternumbers: true/false, index numbers or not.

• -mintermlength: Minimum number of characters in a term.

• -indexfileformat: Format used for serializing/deserializing vectors from disk, default lucene. Another option is text, may be used for debug to see the actual vectors. Too slow on real data.

### Disabled parameters¶

• -luceneindexpath: Currently, you are not allowed to build your own Lucene index and create vectors from it since index + vectors creation is all done in one step.

• -stoplistfile: Replaced by the <http://www.ontotext.com/graphdb/similarity/stopList> predicate. Stop words are passed as a string literal as opposed to a file.

• -elementalmethod

• -docindexing

### Stop words and Lucene Analyzer¶

In the Stop words field, add a custom list of stop words to be passed to the Semantic Vector plugin. If left empty, the default Lucene stop words list will be used.

In the Analyzer class field, set a Lucene analyzer to be used during Semantic Vector indexing and query time tokenization. The default is org.apache.lucene.analysis.en.EnglishAnalyzer, but it can be any from the supported list as well.

Additionally, the Lucene connector also supports custom Analyzer implementations. This way you can create your own analyzer and add it to a classpath. The value of the Analyzer Class parameter must be a fully qualified name of a class that extends org.apache.lucene.analysis.Analyzer.

### Search in the index¶

Go to the list of indexes and click on allNews. For search options, select Search type to be either Term or Document. Result type can be either Term or Document.

### Search parameters¶

• -searchtype: Different types of searches can be performed. Most involve processing combinations of vectors in different ways, in building a query expression, scoring candidates against these query expressions, or both. Default is sum that builds a query by adding together (weighted) vectors for each of the query terms, and search using cosine similarity. See more about SearchType here.

• -matchcase: If true, matching of query terms is case-sensitive; otherwise case-insensitive, default value is false.

• -numsearchresults: Number of search results.

• -searchresultsminscore: Search results with similarity scores below this threshold will not be returned, default value is -1.

See more about it at https://github.com/semanticvectors/semanticvectors/wiki/SearchOptions.

### Delete or rebuild an index using a SPARQL query¶

To delete an index, use the following SPARQL query:

PREFIX inst:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX :<http://www.ontotext.com/graphdb/similarity/>

INSERT DATA {
inst:my_index :deleteIndex "" .
}


To rebuild an index, simply create it again following the steps shown above.

### Search in the index during rebuild with no downtime¶

GraphDB enables you to use the similarity index with no downtime while the database is being modified. While rebuilding the index, its last successfully built version is preserved until the new index is ready. This way, when you search in it during rebuild, the retrieved results will be from this last version. The following message will notify you of this:

The outdated image is then replaced.

### Locality-sensitive hashing¶

Note

As locality-sensitive hashing does not guarantee the retrieval of the most similar results, this hashing is not the most suitable option if precision is essential. Hashing with the same configuration over the same data does not guarantee the same search results.

Locality-sensitive hashing is introduced in order to reduce the searching times. Without a hashing algorithm, a search consists of the following steps:

1. A search vector is generated.

2. All vectors in store are compared to this search vector, and the most similar ones are returned as matches.

While this approach is complete and accurate, it is also time-consuming. In order to speed up the process, hashing can be used to reduce the number of candidates for most similar vectors. This is where Locality-sensitive hashing can be very useful.

The Locality-sensitive hashing algorithm has two parameters that can be passed either during index creation, or as search option:

• -lsh_hashes_num: The number of n random vectors used for hashing, default value is 0.

• -lsh_max_bits_diff: The m number of bits by which two hashes can differ and still be considered similar, default value is 0.

The hashing workflow is as follows:

1. An n number of random orthogonal vectors are generated.

2. Each vector in store is compared to each of those vectors (checking whether their scalar product is positive or not).

3. Given this data, a hash is generated for each of the vectors in store.

During a search, the workflow is as follows:

1. A search vector is generated.

2. A hash is generated for this search vector by comparing it to the n number of random vectors used during the initial hashing.

3. All similar hashes like the one of the searched vector are found. (a hash is considered similar when it has up to m bits difference from the original one).

4. All vectors with such hash are collected and compared to the generated vector in order to get the closest ones, based on the assumption that the vectors with similar hashes will be close to each other.

Note

If both parameters have the same value, then all possible hashes are considered similar and therefore no filtering is done. For optimization purposes in this scenario, the entire hashing logic has been bypassed.

If one of the parameters is specified during the index creation, then its value will be used as the default one for searching.

Depending on its configuration, the hash can perform in different ways.

A higher number of -lsh_hashes_num leads to more hash buckets with fewer elements in them. Conversely, a lower number of hashes would mean fewer but bigger buckets. The n number of hashes leads to 2^n potential buckets.

A higher number of -lsh_max_bits_diff leads to more buckets being checked, and vice versa. More precisely, an m number of -lsh_max_bits_diff with an n number of hashes leads to m-combinations of n + (m - 1) -combination of n + ... + 0 -combinations of n checked buckets.

By modifying these parameters, you can control the number of checked vectors. A lower number of checked vectors leads to higher performance, but also increases the chance of missing a similar vector.

Different settings perform well for different vector store sizes. A reasonable initial configuration is (3, 1). If you want to slightly increase the precision, you can change it to (3, 2). However this will substantially increase the number of checked vectors and reduce performance.

To make finer calibration, you would need a higher number of hashes - for instance, (6, 2) is also a possible configuration.

If you are looking to increase the performance, you could change the configuration to (6, 1) or (8, 2), but this will reduce precision.

If increasing the precision at the cost of performance is an acceptable option for you, you could use the configuration of (6, 3).

Note

If -lsh_max_bits_diff is too close to -lsh_hashes_num, the performance can be poorer compared to the default one because of the computational overhead.

### Search similar news within days¶

The search can be extended by using the advantages of SPARQL to find the same news by different sources. This can be done by filtering all the news from the results related to a given period.

Click on View SPARQL Query, copy the query, and open the SPARQL editor to paste it there. Now you can integrate statistic similarity with RDF to obtain the following query:

# Search similar news (SPARQL) within days

PREFIX similarity:<http://www.ontotext.com/graphdb/similarity/>
PREFIX similarity-index:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?entity ?score ?matchDate ?searchDate  {
BIND (<http://www.uawire.org/merkel-and-putin-discuss-syria-and-nord-stream-2> as ?searchDocumentID)
?search a similarity-index:allNews ;
similarity:searchDocumentID ?searchDocumentID;
similarity:searchParameters [];
similarity:documentResult ?result .
?result similarity:value ?entity ;
similarity:score ?score .
?entity pubo:creationDate ?matchDate  .
?searchDocumentID pubo:creationDate ?searchDate .
FILTER (?matchDate > ?searchDate - "P2D"^^xsd:duration && ?matchDate < ?searchDate + "P2D"^^xsd:duration)
}


Search for similar news, get their creationDate and filter only the news within the time period of two days.

### Boosting a term’s weight¶

It is possible to boost the weight of a given term in the text-based similarity index for term-based searches (Term to term or Term to document). Boosting a term’s weight can be done by using the caret symbol ^ followed by a boosting factor - a positive decimal number term^factor.

For example, UK Brexit^3 EU will perform a search in which the term “Brexit” will have 3 times more weight than “UK” and “EU”, and the results will be expected to be mainly related to “Brexit”.

The default boosting factor is 1. Setting a boosting factor of 0 will completely ignore the given term. Escaping the caret symbol ^ is done with a double backslash \\^.

Note

The boosting will not work in document-based searches (Document to term or Document to document), meaning that the caret following by a number will not be treated as a weight boosting symbol.

## Predication-based Semantic Indexing¶

Predication-based Semantic Indexing, or PSI, is an application of distributional semantic techniques for reasoning and inference. PSI starts with a collection of known facts or observations, and combines them into a single semantic vector model, in which both concepts and relationships are represented. This way, the usual ways for constructing query vectors and searching for results in SemanticVectors can be used to suggest similar concepts based on the knowledge graph.

The predication-based semantic search examples are based on Person data from the DBpedia dataset. The sample dataset contains over 730,000 triples for over 101,000 persons born between 1960 and 1970.

Import the provided persons-1960-1970.ttl.

Create an Autocomplete index by enabling it from the Setup ‣ Autocomplete page.

For ease of use, you may add the following namespaces for the example dataset (done from Setup ‣ Namespaces):

• dbo: http://dbpedia.org/ontology/

• dbr: http://dbpedia.org/resource/

• foaf: http://xmlns.com/foaf/0.1/

### Create predication-based index¶

Create a new predication-based similarity index from Explore ‣ Similarity ‣ Create similarity index. Select Create predication index.

Fill in the index name, and add the desired Semantic Vectors create index parameters. For example, it is a good idea to use term weighting when building indexes, so we will add -termweight idf. Also, for better results, set -dimension to higher than 200 which is the default.

Configure the Data query. This SPARQL SELECT query determines the data that will be indexed. The query must SELECT the following bindings:

• ?subject

• ?predicate

• ?object

The Data query is executed during index creation to obtain the actual data for the index. When data in your repo changes, you need to also rebuild the index. It is a subquery of a more complicated query that you can see with the View Index Query button.

For the given example, leave the default Data query. This will create an index with all triples in the repo:

SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?object .
}


Set the Search query. This SELECT query determines the data that will be fetched on search. The Search query is executed during search. Add more bindings by modifying this query to see more data in the results table.

For the first example, set the Search query to:

PREFIX similarity:<http://www.ontotext.com/graphdb/similarity/>
PREFIX similarity-index:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX psi:<http://www.ontotext.com/graphdb/similarity/psi/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?entity ?name ?description ?birthDate ?birthPlace ?gender ?score {
?search a ?index ;
?searchType ?query;
psi:searchPredicate ?psiPredicate;
similarity:searchParameters ?parameters;
?resultType ?result .
?result similarity:value ?entity ;
similarity:score ?score .
?entity foaf:name ?name .
OPTIONAL { ?entity <http://purl.org/dc/terms/description> ?description . }
OPTIONAL { ?entity dbo:birthPlace ?birthPlace . }
OPTIONAL { ?entity dbo:birthDate ?birthDate . }
OPTIONAL { ?entity foaf:gender ?gender . }
}


Click Create to start index creation.

Once the index has been built, you have the same options as for the text similarity index: View SPARQL query, Edit query, Create index from existing one, Rebuild, and Delete index. Additionally, if you want to edit an index query, you can do it for both the Search and the Analogical queries:

### Search predication-based index¶

From the Existing indexes, select the index you want to search in.

In our example, we will be looking for individuals similar to Hristo Stoichkov - the most famous Bulgarian football player.

In the results, you can see Bulgarian football players born in the same town, other Bulgarian athletes born in the same place, as well as other people with the same birth date.

### Analogical searches¶

Along with searching explicit relations and similarities, PSI can also be used for analogical search.

Suppose you have a dataset with currencies and countries, and want to know the following: “If I use dollars in the USA, what do I use in Mexico?” By using the predicate index, you do not need to know the predicate (“has currency”).

After importing the Nations.ttl sample dataset into a repository, build an Autocomplete index and a predication index following the steps in the documentation above.

Once your predication index is built, you can use the Analogical search option of your index. In logical terms, your query will translate to “If USA implies dollars, what does Mexico imply?”

As you can see, the first result is peso, the Mexican currency. The rest of the results are not relevant in this situation since they are part of a very small dataset.

### Why is this important?¶

PSI supplements traditional tools for artificial inference by giving “nearby” results. In cases where there is a single clear winner, this is essentially the behavior of giving “one right answer”. But in cases where there are several possible plausible answers, having robust approximate answers can be greatly beneficial.

## Hybrid indexing¶

When building a Predication index, it creates a random vector for each entity in the database, and uses these random vectors to generate the similarity vectors to be used later on for similarity searches. This approach does not take into consideration the similarity between the literals themselves. Let’s examine the following example, using the FactForge data from the previous parts of the page:

<express:donald-tusk-eu-poland-leave-european-union-polexit> <pubo:formattedDate> 1/11/2018
<telegraph:donald-tusk-warnspoland-could-hold-brexit-style-eu-referendum> <pubo:formattedDate> 1/11/2018
<express:POLAND-s-bid-for-World-War-2-reparations-is-bolstered-by-a-poll-which-found-that-a-majorit> <pubo:formattedDate> 1/6/2018


Naturally we would expect the first news article to be more similar to the second one than to the third one, not only based on their topics - Poland’s relationship with the EU - but also because of their dates. However, the normal Predication index would not take into account the similarity of the dates, and all news would have fairly close scores. In order to handle this type of scenario, we can first create a Text similarity index. It will find that the dates of the three articles are similar, and will then use this information when building the Predication index.

In order to do so, you need to:

### Edit the FactForge data¶

Dates, as presented in FactForge, are not literals that the similarity plugin can handle easily. This is why you need to format them to something easier to parse.

PREFIX pub: <http://ontology.ontotext.com/taxonomy/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
insert {
?x pubo:formattedDate ?displayDate
}
WHERE {
?x pubo:creationDate ?date.
BIND (CONCAT(STR(MONTH(?date)),
"/",
STR(DAY(?date)),
"/",
STR(YEAR(?date))) as ?displayDate)
}


Replacing dateTime with a simple string will enable you to create a Literal index.

At this stage, you should enable Autocomplete in case you have not enabled it yet, so as to make testing easier.

Go to Setup, and enable the Autocomplete index for allNews.

### Create a Literal index¶

The Literal index is a subtype of the Text index. To build it, create a normal Text index by ticking the Literal index checkbox from the More options menu. This type of indexes can only be used as input indexes for predication indexes, and will be indicated in the Similarity page. They can not be used for similarity searching. The index will include all literals returned by the ?documentText variable from the Data query.

Make sure to filter out the mentions, so the data in the Literal index only contains the news. When creating the index, use the following Data query:

SELECT ?documentID ?documentText {
?documentID ?p ?documentText .
filter(isLiteral(?documentText))
filter (?p != <http://factforge.net/ff2016-mapping/mentionsEntity>)
}


### Use the Literal index¶

When creating the predication index from the “more options” menu, select Input Literal Index -> the index created in the previous step.

Since you do not want to look at mentions, and in this sense the default data format is useless, you need to filter them out from the data used in the predication index. Add the following Data query:

SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?object .
filter (?predicate != <http://factforge.net/ff2016-mapping/mentionsEntity>)
filter (?predicate != <http://ontology.ontotext.com/publishing#creationDate>)
}


For the purposes of the test, we want to also display the new formatted date when retrieving data. Go to the search query tab and add the following query:

PREFIX similarity:<http://www.ontotext.com/graphdb/similarity/>
PREFIX similarity-index:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX psi:<http://www.ontotext.com/graphdb/similarity/psi/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>

SELECT ?entity ?score ?content ?date {
?search a ?index ;
?searchType ?query;
psi:searchPredicate ?psiPredicate;
similarity:searchParameters ?parameters;
?resultType ?result .
?result similarity:value ?entity ;
similarity:score ?score .
?entity pubo:content ?content .
?entity pubo:formattedDate ?date .
}


With those two queries in place, the data returned from the index should be more useful. Create your hybrid predication index and wait for the process to be completed. Then, open it and run a query for “donald tusk”, selecting the express article about “Polexit” from the Autocomplete suggest box. You will see that the first results are related to the Polexit and dated the same.

### Indexing behavior¶

When building the Literal index, it is a good idea to index all literals that will be indexed in the Predication index, or at least all literals of the same type. Continuing with the example above, let’s say that the Literal index you have created only returns these three news pieces. Add the following triple about a hypothetical Guardian article, and create a Predication index to index all news:

<guardian:poland-grain-exports> <pubo:formattedDate> 12/08/2017


Based on the triples, it would be expected that the first article will be equally similar to the third and the new one - their contents and dates have little in common. However, depending on the binding method used when creating the Predication index, you can get higher score for the third article compared to the new one only because the third article has been indexed by the Literal index. There are two ways to easily avoid this - either all literals, or at least all dates are indexed.

### Manual creation¶

If you are not using the Similarity page, you could pass the following options when creating the indexes:

• -literal_index true: passed to a Text index creates a Literal index

• -input_index <literaIndex> (replace <literalIndex> with the name of an existing Literal index): passed to a Predication index creates a hybrid index based on a Literal index

## Training cycles¶

When building Text and Predication indexes, training cycles could be used to increase the accuracy of the index. The number of training cycles can be set by passing the option:

• -trainingcycles <numOfCycles>: The default number of training cycles is 0.

Text and Predication indexes have quite different implementations of the training cycles.

Text indexes just repeat the same algorithm multiple times, which leads to algorithm convergence.

Predication indexes initially start the training with a random vector for each entity in the database. On each cycle, the initially random elemental vectors are replaced with the product of the previous cycle, and the algorithm is run again. In addition to the entity vectors, the predicate vectors get trained as well. This leads to higher computational time for a cycle compared to the initial run (with trainingcycles = 0).

Note

Each training cycle is time- and computationally consuming, and a higher number of cycles will greatly increase the building time.