# Semantic similarity searches¶

## Why do I need similarity plugin?¶

The similarity plugin allows exploring and searching semantic similarity in RDF resources.

Often the users need to solve also cases where statistical semantics queries will be highly valuable like: For this text (encoded as a literal in the database) return the closest texts based on a vector space model.

Another type of use case is the clustering news (from a news feed) in groups by discussing events.

## What the similarity plugin does?¶

Humans determine the similarity between texts based and the similarity of the composing words and their abstract meaning. Documents composed by similar words are semantically related and words frequently co-occurring are also considered close. The plugin supports document and term searches. A document is a literal or an aggregation of multiple literals and a term is a word from the documents.

There are four type of similarity searches:

• Term to term - returns the closest semantic related terms
• Term to document - returns the most representative documents for a specific searched term
• Document to term - returns the most representative terms for a specific document
• Document to document - returns the closest related texts

## How the similarity plugin works?¶

The similarity plugin integrates the semantic vectors library and the underlying Random Indexing algorithm. The algorithm uses a tokenizer to translate documents to sequences of words (terms) and represent them into a vector space model representing their abstract meaning. A distinctive feature of the algorithm is the dimensionality reduction approach based on Random Projection, where the initial vector state is generated randomly. With the indexing of each document, the term vectors are adjusted based on the contextual words. This approach makes the algorithm highly scalable for very large text corpus of documents and research papers have proven that its efficiency is comparable to more sound dimensionality reduction algorithms like singular value decomposition.

### Search similar terms¶

The example shows terms similar to “novichok” in the search index allNews. The term “novichok” is used in the search field. The selected option for both Search type and Result type is Term. Sample results of terms similar to “novichok” - listed by their score - are given below.

### Search documents for which selected term is specific¶

The term “novichok” is used as an example again. The selected option for Search type is Term, and for Result type is Document. Sample results of the most representative documents for a specific searched term - listed by their score - are given below.

### Search specific terms in selected document¶

The result with the highest score from the previous search is used in the new search. The selected option for Search type is Document, and for Result type is Term. Sample results of the most representative terms - listed by their score - are given below.

### Search for closest documents¶

A search for the texts closest to the selected document is also possible. The same document is used in the search field. Sample results of the documents with the closest texts to the selected document - listed by their score - are given below. The titles of the documents prove that their content is similar though the sources are different.

For the sample results listed above to be received, it is necessary to download data and create an index. Data from factforge.net is used in the following example. News from January to April 2018, together with their content, creationDate and mentionsEntity triples are downloaded.

Go to the SPARQL editor at http://factforge.net/sparql and write the following query:

PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX pub: <http://ontology.ontotext.com/taxonomy/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>

CONSTRUCT {
?document ff-map:mentionsEntity ?entity .
?document pubo:content ?content .
?document pubo:creationDate ?date .
} WHERE {
?document a pubo:Document .
?document ff-map:mentionsEntity ?entity .
?document pubo:content ?content .
?document pubo:creationDate ?date .
FILTER (?p NOT IN (pubo:containsMention, pubo:hasFeature, pubo:hasImage))
FILTER ( (?date > "2018-01-01"^^xsd:dateTime) && (?date < "2018-04-30"^^xsd:dateTime))
}


Download the data using the Download As button - choose Turtle option. It will take some time to export the data to query-result.ttl file.

Go to your local GraphDB instance and create a new repository “news”.

Move the downloaded file to <HOME>/graphdb-import folder to be visible in import->RDF->Server files.

Import query-result.ttl file in your new repository “news”.

Go to Setup, enable Autocomplete index and create an index for allNews, using Build Now button.

Autocomplete index is used for autocompletion of URLs in the SPARQL editor and the View resource page.

## Text based similarity searches¶

### Create text similarity index¶

Create index for allNews. Similarity indexes help you look up semantically similar entities and texts.

Go to Explore -> Similarity -> Create similarity index and change the query to:

PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?documentID ?documentText
{
?documentID pubo:content ?documentText .
FILTER(isLiteral(?documentText))
}

Please note that there are default parameters:
-termweight idf

This will index allNews content where the ID of a document is the news’ IRI and the text is the news’ content.

Name the index ‘allNews’, save it and wait until it’s ready. Using {...} button you can review or copy SPARQL Query for this index.

### Create index parameters¶

• -vectortype - real, complex, binary - Real, Comlplex and Binary Semantic Vectors
• -dimension - dimension of semantic vector space, default value 200. Recommended values are in the hundreds for real and complex and in the thousands for binary, since binary dimensions are single bits. Smaller dimensions make both indexing and queries faster, but if the dimension is too low, then the orthogonality of the element vectors will be compromised leading to poorer results. An intuition for the optimal values is given by the Johnson–Lindenstrauss lemma
• -seedlength - Number of nonzero entries in a sparse random vector, default value 10 except for when vectortype is binary, in which case default of dimension / 2 is enforced. For real and complex vectors default value is 10, but it’s a good idea to use a higher value when the vector dimension is higher than 200. Simplest thing to do is to preserve this ratio, i.e. to divide the dimension by 20. It’s worth mentioning that in the original implementation of random indexing, the ratio of non-zero elements was 1/3.
• -trainingcycles - Number of training cycles used for Reflective Random Indexing
• -termweight - Term weighting used when constructing document vectors. Values can be none, idf, logentropy, sqrt. It is a good idea to use term weighting when building indexes so we add -termweight idf as a default when creating an index. It uses inverse document frequency when building the vectors. See LuceneUtils for more details.
• -minfrequency - Minimum number of times that a term has to occur in order to be indexed. Default value is set to 0, but it would be a bad idea to use it, as that would add a lot of big numbers/weird terms/misspelled words to the list of word vectors. Best approach would be to set it as a fraction of the total word count in the corpus. For example 40 per million as a frequency threshold. Another approach is to start with an intuitive value, a single digit number like 3-4, and start fine tuning from there.
• -maxfrequency - Maximum number of times that a term can occur before getting removed from indexes. Default value is Integer.MAX_VALUE. Again, a better approach is to calculate it as a percentage of the total word count. Otherwise, you can use the default value and add most common english words to the stoplist.
• -maxnonalphabetchars - Maximum number of non alphabet characters a term can contain in order to be indexed. Default value is Integer.MAX_VALUE. Recommended values depend on the dataset and the type of terms it contains, but setting it to 0 works pretty well for most basic cases, as it takes care of punctuation (if data has not been preprocessed), malformed terms, and weird codes and abbreviations.
• -filternumbers - true/false, index numbers or not
• -mintermlength - Minimum number of characters in a term
• -porterstemmer - To stem words using the Porter Stemmer or not. Note that it is not possible to pass the same parameter during search. So you should take this into account when searching your terms after that to stem them prior to searching.
• -indexfileformat - Format used for serializing / deserializing vectors from disk, default lucene. Other option is text, may be used for debug to see the actual vectors. Too slow on real data.

### Disabled parameters¶

• -luceneindexpath - Currently, you are not allowed to build your own lucene index and create vectors from it since index + vectors creation is all done in one step.
• -stoplistfile - replaced by <http://www.ontotext.com/graphdb/similarity/stopList> predicate. Stop words are passed as a string literal, not a file.
• -elementalmethod
• -docindexing

### Stop words and Lucene Analyzer¶

In Stop words add a custom list of stop words to be passed to the Semantic Vector plugin. The default Lucene stop words list will be used if it empty.

In Analyzer class set a Lucene analyzer to be used during Semantic Vector indexing and query time tokenization. The default is org.apache.lucene.analysis.en.EnglishAnalyzer but it can be any from the supported list.

Also, the Lucene connector supports custom Analyser implementations. So, you can create your own analyzer and add it to classpath. The value of the Analyzer Class parameter must be a fully qualified name of a class that extends org.apache.lucene.analysis.Analyzer.

### Search in the index¶

Go to the list of indexes and click on allNews index. For search options, select Search type to be either Term or Document. Result type can be either Term or Document.

### Search parameters¶

• -searchtype - Different types of searches that can be performed. Most involve processing combinations of vectors in different ways, in building a query expression, scoring candidates against these query expressions, or both. Default is sum that builds a query by adding together (weighted) vectors for each of the query terms, and search using cosine similarity. See SearchType in pitt.search.semanticvectors.Search
• -matchcase - If true, matching of query terms is case-sensitive; otherwise case-insensitive, default false.
• -numsearchresults - number of search results

### Search similar news within days¶

The search can be extended by using the power of SPARQL to find the same news by different sources. This can be done by filtering all the news from the results related to a given period.

Click on View SPARQL Query, copy the query and go to the SPARQL editor to paste it there. Now you can integrate statistic similarity with RDF to obtain the following query:

# Search similar news (SPARQL) within days

PREFIX :<http://www.ontotext.com/graphdb/similarity/>
PREFIX inst:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?entity ?score ?matchDate ?searchDate  {
BIND (<http://www.uawire.org/merkel-and-putin-discuss-syria-and-nord-stream-2> as ?searchDocumentID)
?search a inst:allNews ;
:searchDocumentID ?searchDocumentID;
:searchParameters "";
:documentResult ?result .
?result :value ?entity ;
:score ?score .
?entity pubo:creationDate ?matchDate  .
?searchDocumentID pubo:creationDate ?searchDate .
FILTER (?matchDate > ?searchDate - "P2D"^^xsd:duration && ?matchDate < ?searchDate + "P2D"^^xsd:duration)
}


Search for similar news, get their creationDate and filter only the news within 2 days.

## Predication-based Semantic Indexing¶

Predication-based Semantic Indexing, or PSI, is an application of distributional semantic techniques to reasoning and inference. PSI starts with a collection of known facts or observations, and combines them into a single semantic vector model in which concepts and relationships are both represented. Then the usual ways for constructing query vectors and searching for results in SemanticVectors can be used to suggest similar concepts based on the knowledge graph.

The predication-based semantic search examples are based on Persons data from DBPedia dataset. The sample dataset contains over 730,000 triples for more than 101,000 persons born between 1960 and 1970.

Import provided persons-1960-1970.ttl .

Create an Autocomplete index by switching ON the autocomplete from Setup->Autocomplete page.

For ease of use you may add the following namespaces for the example dataset (from Setup->Namespaces page):

### Create predication-based index¶

Create a new predication-based similarity index from Explore -> Similarity -> Create similarity index. Select the tab “Create predication index”

Fill-in the index name.

Add the desired Semantic Vectors create index parameters. For example, it is a good idea to use term weighting when building indexes so we will add -termweight idf. Also, for better result, set -dimension to more that 200 which is the default.

Set the Data query. This SPARQL SELECT query determines the data that will be indexed. The query must SELECT the following bindings:

• ?subject
• ?predicate
• ?object

The Data query query is executed during index creation to obtain the actual data for the index. When data in your repo changes you should rebuild the index. It is a subquery of a more complicated query you can see from the ‘View Index Query’ link.

For the given example leave the default Data query, this will create an index with all triples in the repo:

SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?object .
}


Set the Search query. This SELECT query determines the data that will be fetched on search. The Search query query is executed during search. Add more bindings by modifying this query to see more data in the results table.

For the first example, set the Search query to:

PREFIX :<http://www.ontotext.com/graphdb/similarity/>
PREFIX inst:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX psi:<http://www.ontotext.com/graphdb/similarity/psi/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?entity ?name ?description ?birthDate ?birthPlace ?gender ?score {
?search a ?index ;
?searchType ?query;
psi:searchPredicate ?psiPredicate;
:searchParameters ?parameters;
?resultType ?result .
?result :value ?entity ;
:score ?score .
?entity foaf:name ?name .
OPTIONAL { ?entity <http://purl.org/dc/terms/description> ?description . }
OPTIONAL { ?entity dbo:birthPlace ?birthPlace . }
OPTIONAL { ?entity dbo:birthDate ?birthDate . }
OPTIONAL { ?entity foaf:gender ?gender . }
}


Click Create button to start index creation.

### Search predication-based index¶

From the Existing indexes select the index you want to search in.

In our example we are looking for similar people to Hristo Stoichkov - the most famous Bulgarian football player.

In the result, you may see Bulgarian footbal players born in the same town, other Bulgarian sportsman born in the same place or other persons with the same birth date.

### Why is this important?¶

PSI supplements traditional tools for artificial inference by giving “nearby” results. In cases where there is a single clear winner, this recovers the behavior of giving “one right answer”. But in cases where there are several possible plausible answers, it can be a great benefit to have robust approximate answers.