Semantic similarity searches

Why do I need the similarity plugin?

The similarity plugin allows exploring and searching semantic similarity in RDF resources.

Often the users need to solve also cases where statistical semantics queries will be highly valuable like: For this text (encoded as a literal in the database) return the closest texts based on a vector space model.

Another type of use case is the clustering news (from a news feed) in groups by discussing events.

What the similarity plugin does?

Humans determine the similarity between texts based and the similarity of the composing words and their abstract meaning. Documents composed by similar words are semantically related and words frequently co-occurring are also considered close. The plugin supports document and term searches. A document is a literal or an aggregation of multiple literals and a term is a word from the documents.

There are four type of similarity searches:

  • Term to term - returns the closest semantic related terms
  • Term to document - returns the most representative documents for a specific searched term
  • Document to term - returns the most representative terms for a specific document
  • Document to document - returns the closest related texts

How the similarity plugin works?

The similarity plugin integrates the semantic vectors library and the underlying Random Indexing algorithm. The algorithm uses a tokenizer to translate documents to sequences of words (terms) and represent them into a vector space model representing their abstract meaning. A distinctive feature of the algorithm is the dimensionality reduction approach based on Random Projection, where the initial vector state is generated randomly. With the indexing of each document, the term vectors are adjusted based on the contextual words. This approach makes the algorithm highly scalable for very large text corpus of documents and research papers have proven that its efficiency is comparable to more sound dimensionality reduction algorithms like singular value decomposition.

Search similar terms

The example shows terms similar to “novichok” in the search index allNews. The term “novichok” is used in the search field. The selected option for both Search type and Result type is Term. Sample results of terms similar to “novichok” - listed by their score - are given below.

_images/Novichok1.png

Search documents for which selected term is specific

The term “novichok” is used as an example again. The selected option for Search type is Term, and for Result type is Document. Sample results of the most representative documents for a specific searched term - listed by their score - are given below.

_images/Novichok2.png

Search specific terms in selected document

The result with the highest score from the previous search is used in the new search. The selected option for Search type is Document, and for Result type is Term. Sample results of the most representative terms - listed by their score - are given below.

_images/Novichok3.png

Search for closest documents

A search for the texts closest to the selected document is also possible. The same document is used in the search field. Sample results of the documents with the closest texts to the selected document - listed by their score - are given below. The titles of the documents prove that their content is similar though the sources are different.

_images/Novichok4.png

Download data

For the sample results listed above to be received, it is necessary to download data and create an index. Data from factforge.net is used in the following example. News from January to April 2018, together with their content, creationDate and mentionsEntity triples are downloaded.

Go to the SPARQL editor at http://factforge.net/sparql and write the following query:

PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX pub: <http://ontology.ontotext.com/taxonomy/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX ff-map: <http://factforge.net/ff2016-mapping/>

CONSTRUCT {
        ?document ff-map:mentionsEntity ?entity .
        ?document pubo:content ?content .
        ?document pubo:creationDate ?date .
} WHERE {
        ?document a pubo:Document .
        ?document ff-map:mentionsEntity ?entity .
        ?document pubo:content ?content .
        ?document pubo:creationDate ?date .
    FILTER (?p NOT IN (pubo:containsMention, pubo:hasFeature, pubo:hasImage))
    FILTER ( (?date > "2018-01-01"^^xsd:dateTime) && (?date < "2018-04-30"^^xsd:dateTime))
}
_images/FactForgeSPARQL.png

Download the data using the Download As button - choose Turtle option. It will take some time to export the data to query-result.ttl file.

Go to your local GraphDB instance and create a new repository “news”.

Move the downloaded file to <HOME>/graphdb-import folder to be visible in import->RDF->Server files.

Import query-result.ttl file in your new repository “news”.

Go to Setup, enable Autocomplete index and create an index for allNews, using Build Now button.

Autocomplete index is used for autocompletion of URLs in the SPARQL editor and the View resource page.

_images/Autocomplete.png

Text based similarity searches

Create text similarity index

Create index for allNews. Similarity indexes help you look up semantically similar entities and texts.

Go to Explore -> Similarity -> Create similarity index and change the query to:

PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?documentID ?documentText
{
    ?documentID pubo:content ?documentText .
    FILTER(isLiteral(?documentText))
}
Please note that there are default parameters:
-termweight idf

This will index allNews content where the ID of a document is the news’ IRI and the text is the news’ content.

Name the index ‘allNews’, save it and wait until it’s ready. Using {...} button you can review or copy SPARQL Query for this index.

Create index parameters

  • -vectortype - real, complex, binary - Real, Comlplex and Binary Semantic Vectors
  • -dimension - dimension of semantic vector space, default value 200. Recommended values are in the hundreds for real and complex and in the thousands for binary, since binary dimensions are single bits. Smaller dimensions make both indexing and queries faster, but if the dimension is too low, then the orthogonality of the element vectors will be compromised leading to poorer results. An intuition for the optimal values is given by the Johnson–Lindenstrauss lemma
  • -seedlength - Number of nonzero entries in a sparse random vector, default value 10 except for when vectortype is binary, in which case default of dimension / 2 is enforced. For real and complex vectors default value is 10, but it’s a good idea to use a higher value when the vector dimension is higher than 200. Simplest thing to do is to preserve this ratio, i.e. to divide the dimension by 20. It’s worth mentioning that in the original implementation of random indexing, the ratio of non-zero elements was 1/3.
  • -trainingcycles - Number of training cycles used for Reflective Random Indexing
  • -termweight - Term weighting used when constructing document vectors. Values can be none, idf, logentropy, sqrt. It is a good idea to use term weighting when building indexes so we add -termweight idf as a default when creating an index. It uses inverse document frequency when building the vectors. See LuceneUtils for more details.
  • -minfrequency - Minimum number of times that a term has to occur in order to be indexed. Default value is set to 0, but it would be a bad idea to use it, as that would add a lot of big numbers/weird terms/misspelled words to the list of word vectors. Best approach would be to set it as a fraction of the total word count in the corpus. For example 40 per million as a frequency threshold. Another approach is to start with an intuitive value, a single digit number like 3-4, and start fine tuning from there.
  • -maxfrequency - Maximum number of times that a term can occur before getting removed from indexes. Default value is Integer.MAX_VALUE. Again, a better approach is to calculate it as a percentage of the total word count. Otherwise, you can use the default value and add most common english words to the stoplist.
  • -maxnonalphabetchars - Maximum number of non alphabet characters a term can contain in order to be indexed. Default value is Integer.MAX_VALUE. Recommended values depend on the dataset and the type of terms it contains, but setting it to 0 works pretty well for most basic cases, as it takes care of punctuation (if data has not been preprocessed), malformed terms, and weird codes and abbreviations.
  • -filternumbers - true/false, index numbers or not
  • -mintermlength - Minimum number of characters in a term
  • -indexfileformat - Format used for serializing / deserializing vectors from disk, default lucene. Other option is text, may be used for debug to see the actual vectors. Too slow on real data.

Disabled parameters

  • -luceneindexpath - Currently, you are not allowed to build your own lucene index and create vectors from it since index + vectors creation is all done in one step.
  • -stoplistfile - replaced by <http://www.ontotext.com/graphdb/similarity/stopList> predicate. Stop words are passed as a string literal, not a file.
  • -elementalmethod
  • -docindexing

Stop words and Lucene Analyzer

In Stop words add a custom list of stop words to be passed to the Semantic Vector plugin. The default Lucene stop words list will be used if it empty.

In Analyzer class set a Lucene analyzer to be used during Semantic Vector indexing and query time tokenization. The default is org.apache.lucene.analysis.en.EnglishAnalyzer but it can be any from the supported list.

Also, the Lucene connector supports custom Analyser implementations. So, you can create your own analyzer and add it to classpath. The value of the Analyzer Class parameter must be a fully qualified name of a class that extends org.apache.lucene.analysis.Analyzer.

Search in the index

Go to the list of indexes and click on allNews index. For search options, select Search type to be either Term or Document. Result type can be either Term or Document.

_images/SearchInAllNews.png

Search parameters

  • -searchtype - Different types of searches that can be performed. Most involve processing combinations of vectors in different ways, in building a query expression, scoring candidates against these query expressions, or both. Default is sum that builds a query by adding together (weighted) vectors for each of the query terms, and search using cosine similarity. See SearchType in pitt.search.semanticvectors.Search
  • -matchcase - If true, matching of query terms is case-sensitive; otherwise case-insensitive, default false.
  • -numsearchresults - number of search results

Additional info at: https://github.com/semanticvectors/semanticvectors/wiki/SearchOptions

Locality-Sensitive Hashing

Note

Locality-Sensitive Hashing does NOT guarantee that the most similar results will be returned! If precision is essential this hashing is not an option! Hashing with the same configuration over the same data does NOT guarantee the same search results!

The Locality-Sensitive Hashing option is introduced in order to reduce the searching times. Without hashing algorithm, a search is performed in following steps:

  • A search vector is generated
  • All vectors in store are compared to this search vector and the most similar ones are returned as matches

This approach is complete and accurate but time-consuming one. In order to speed up this process, hashing could be used to reduce the number of candidates for most-similar vectors and here comes the place of Locality-Sensitive Hashing.

The Locality-Sensitive Hashing algorithm has two parameters which can be passed either during index creation or as search option:

  • -lsh_hashes_num - the number of n random vectors used for hashing. Default value: 0.
  • -lsh_max_bits_diff - the m number of bits by which two hashes can differentiate and still be considered similar. Default value: 0.

The hashing workflow is as follows:

  • A n number of random orthogonal vectors are generated
  • Each vector in store is compared to each of those vectors ( if their scalar product is positive or not )
  • Given this data, a hash is generated for each of the vectors in store

During a search the workflow is:

  • A search vector is generated
  • A hash is generated for this search vector, by comparing it to the n random vectors used during the initial hashing
  • All similar hashes like the one of the searched vector are found. ( for similar hash is considered a hash which has up to m bits difference with the original one )
  • All vectors with such hash are gathered and compared to the generated vector to get the closest ones, relying on the assumption that the vectors with similar hashes will be close to each other.

Note

If both parameters have same value then all possible hashes are considered similar and therefore no filtering is done. For optimization purposes, in this scenario the whole hashing logic is bypassed.

If one of the parameters is specified during the index creation then its value will be used as a default one for searching.

Depending on its configuration the hash can perform in completely different ways.

The higher number of -lsh_hashes_num leads to more hash buckets with fewer elements in them. On the contrary, the lower number of hashes means less but bigger buckets. n number of hashes leads to 2^n potential buckets.

The higher number of -lsh_max_bits_diff leads to more buckets being checked and vice versa. To be precise m number of -lsh_max_bits_diff with n number of hashes leads to m-combinations of n + (m - 1) -combination of n + ... + 0 -combinations of n checked buckets.

By modifying these parameters one can essentially control the number of checked vectors. The Lower number of checked vectors lead to higher performance but also increases the chance of missing a similar vector.

For different vector store sizes different settings perform well. A reasonable initial configuration is (3, 1). If one would like to slightly increase the precision could change the configuration to (3, 2) but this will highly increase the number of checked vectors and reduce performance.

To be able to make finer calibration a higher number of hashes are needed. For instance (6, 2) is also a possible configuration.

If increasing the performance is what one is looking for, could change configuration to (6, 1) or (8, 2), but this will reduce precision.

If increasing the precision, at the cost of performance is desired result configuration (6, 3) could be used.

Note

If -lsh_max_bits_diff is too close to -lsh_hashes_num the performance can be worse than the default one because of the computational overhead.

Search similar news within days

The search can be extended by using the power of SPARQL to find the same news by different sources. This can be done by filtering all the news from the results related to a given period.

Click on View SPARQL Query, copy the query and go to the SPARQL editor to paste it there. Now you can integrate statistic similarity with RDF to obtain the following query:

# Search similar news (SPARQL) within days

PREFIX :<http://www.ontotext.com/graphdb/similarity/>
PREFIX inst:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?entity ?score ?matchDate ?searchDate  {
    BIND (<http://www.uawire.org/merkel-and-putin-discuss-syria-and-nord-stream-2> as ?searchDocumentID)
       ?search a inst:allNews ;
          :searchDocumentID ?searchDocumentID;
        :searchParameters "";
        :documentResult ?result .
        ?result :value ?entity ;
        :score ?score .
        ?entity pubo:creationDate ?matchDate  .
        ?searchDocumentID pubo:creationDate ?searchDate .
    FILTER (?matchDate > ?searchDate - "P2D"^^xsd:duration && ?matchDate < ?searchDate + "P2D"^^xsd:duration)
}

Search for similar news, get their creationDate and filter only the news within 2 days.

_images/filter2days.png

Boosting term’s weight

It is possible to boost the weight of a given term in the Text-based similarity index for term-based searches (Term to Term or Term to Document). Boosting a term’s weight can be done by using the caret symbol ^ followed by a boosting factor - a positive decimal number term^factor.

For example, UK Brexit^3 EU will perform a search in which the term “Brexit” will have 3 times more weight than “UK” and “EU” and the expected results to be mainly related to “Brexit”.

The default boosting factor is 1 . Setting a boosting factor of 0 will completely ignore the given term. Escaping the caret symbol ^ is done by a double backslash \\^.

Note

The boosting will not work in document-based searches (Document to Term or Document to Document) meaning that the caret following by a number will not be treated as a boosting weight symbol.

Predication-based Semantic Indexing

Predication-based Semantic Indexing, or PSI, is an application of distributional semantic techniques to reasoning and inference. PSI starts with a collection of known facts or observations, and combines them into a single semantic vector model in which concepts and relationships are both represented. Then the usual ways for constructing query vectors and searching for results in SemanticVectors can be used to suggest similar concepts based on the knowledge graph.

Load example data

The predication-based semantic search examples are based on Persons data from DBPedia dataset. The sample dataset contains over 730,000 triples for more than 101,000 persons born between 1960 and 1970.

Import provided persons-1960-1970.ttl .

Create an Autocomplete index by switching ON the autocomplete from Setup->Autocomplete page.

For ease of use you may add the following namespaces for the example dataset (from Setup->Namespaces page):

Create predication-based index

Create a new predication-based similarity index from Explore -> Similarity -> Create similarity index. Select the tab “Create predication index”

_images/similarity-PSI-create-index-e1.png

Fill-in the index name.

Add the desired Semantic Vectors create index parameters. For example, it is a good idea to use term weighting when building indexes so we will add -termweight idf. Also, for better result, set -dimension to more that 200 which is the default.

Set the Data query. This SPARQL SELECT query determines the data that will be indexed. The query must SELECT the following bindings:

  • ?subject
  • ?predicate
  • ?object

The Data query query is executed during index creation to obtain the actual data for the index. When data in your repo changes you should rebuild the index. It is a subquery of a more complicated query you can see from the ‘View Index Query’ link.

For the given example leave the default Data query, this will create an index with all triples in the repo:

SELECT ?subject ?predicate ?object
  WHERE {
      ?subject ?predicate ?object .
  }

Set the Search query. This SELECT query determines the data that will be fetched on search. The Search query query is executed during search. Add more bindings by modifying this query to see more data in the results table.

For the first example, set the Search query to:

PREFIX :<http://www.ontotext.com/graphdb/similarity/>
PREFIX inst:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX psi:<http://www.ontotext.com/graphdb/similarity/psi/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?entity ?name ?description ?birthDate ?birthPlace ?gender ?score {
    ?search a ?index ;
        ?searchType ?query;
        psi:searchPredicate ?psiPredicate;
        :searchParameters ?parameters;
    ?resultType ?result .
    ?result :value ?entity ;
        :score ?score .
    ?entity foaf:name ?name .
    OPTIONAL { ?entity <http://purl.org/dc/terms/description> ?description . }
    OPTIONAL { ?entity dbo:birthPlace ?birthPlace . }
    OPTIONAL { ?entity dbo:birthDate ?birthDate . }
    OPTIONAL { ?entity foaf:gender ?gender . }
}

Click Create button to start index creation.

Search predication-based index

From the Existing indexes select the index you want to search in.

In our example we are looking for similar people to Hristo Stoichkov - the most famous Bulgarian football player.

_images/similarity-PSI-search-index.png

In the result, you may see Bulgarian footbal players born in the same town, other Bulgarian sportsman born in the same place or other persons with the same birth date.

_images/similarity-PSI-example1-result.png

Analogical searches

As well as searching explicit relations and similarities, PSI can be used for analogical search. For example, suppose you have a dataset for currencies and countries, and just wanted to know “If I use dollars in the USA, what do I use in Mexico?” Using the predicate index you don’t need to know the predicate which means “has currency”.

Sample dataset: nations.ttl <files/nations.ttl>. Given this dataset, build an Autocomplete index and a predication index, following the steps in the documentation above.

Then, once your predication index is built, you can use the Analogical search option of your index. In logical terms, your query will translate to “If USA implies dollars, what does Mexico imply?”

_images/analogical.png

As you can see, the first result is pesos, the Mexican currency. The rest of the results are not relevant in this situation since they are part of a very small dataset.

Why is this important?

PSI supplements traditional tools for artificial inference by giving “nearby” results. In cases where there is a single clear winner, this recovers the behavior of giving “one right answer”. But in cases where there are several possible plausible answers, it can be a great benefit to have robust approximate answers.

Hybrid indexing

When building a Predication index it creates a random vector for each entity in the database and uses these random vectors to generate the similarity vectors to be used later on for similarity searches. This approach does not take into consideration the similarity between the literals themselves. Let’s examine the following example, using the FactForge data from the previous parts of the page:

<express:donald-tusk-eu-poland-leave-european-union-polexit> <pubo:formattedDate> 1/11/2018
<telegraph:donald-tusk-warnspoland-could-hold-brexit-style-eu-referendum> <pubo:formattedDate> 1/11/2018
<express:POLAND-s-bid-for-World-War-2-reparations-is-bolstered-by-a-poll-which-found-that-a-majorit> <pubo:formattedDate> 1/6/2018

Naturally we would expect the first news article to be more similar to the second one than to the third one, not only based on their topics - Poland’s relationship with the EU - but also due to their dates. However, the normal Predication index would not take into account the similarity of the dates and all news would have fairly close scores. In order to handle this type of scenario we can first create a Text index which will find that the dates of the three articles are similar and then use this information when building the Predication index.

In order to do so you should:

Edit the FactForge data

Dates, as presented in FactForge, are not literals which the similarity plugin can handle easily. So, we should format them to something easier to parse.

PREFIX pub: <http://ontology.ontotext.com/taxonomy/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
insert {
    ?x pubo:formattedDate ?displayDate
}
WHERE {
    ?x pubo:creationDate ?date.
    BIND (CONCAT(STR(MONTH(?date)),
                "/",
                STR(DAY(?date)),
                "/",
            STR(YEAR(?date))) as ?displayDate)
}

Replacing the dateTime with a simple string will enable us to create a Literal index.

At this stage, if you don’t have Autocomplete enabled, you should enable it to make testing easier.

Go to Setup, enable Autocomplete index and create an index for allNews, using Build Now button.

Create a Literal index

The Literal index is a subtype of the Text index. In order to build one create a normal Text index with the checkbox “Literal index” (from the “more options” menu) checked. These type of indexes can only be used as input indexes for Predication indexes and will be indicated in the Similarity page. They can not be used for similarity searching. The index will include all literals returned by the ?documentText variable from the Data query.

_images/literalIndex.png

We want to make sure that we filter out the mentions so the data in our Literal index only contains the news. When creating the index, use the following Data query:

SELECT ?documentID ?documentText {
        ?documentID ?p ?documentText .
        filter(isLiteral(?documentText))
        filter (?p != <http://factforge.net/ff2016-mapping/mentionsEntity>)
}

Use the Literal index

When creating the predication index from the “more options” menu select Input Literal Index -> the index created in the previous step.

_images/predicateIndex.png

Since we don’t want to look at mentions and the default data format is useless to us, we should filter them out from the data used in the predicate index. Add the following Data query:

SELECT ?subject ?predicate ?object
WHERE {
    ?subject ?predicate ?object .
        filter (?predicate != <http://factforge.net/ff2016-mapping/mentionsEntity>)
        filter (?predicate != <http://ontology.ontotext.com/publishing#creationDate>)
}

For the sake of the test, we want to also display the new formatted date when retrieving data. Go to the search query tab and add the following query:

PREFIX :<http://www.ontotext.com/graphdb/similarity/>
PREFIX inst:<http://www.ontotext.com/graphdb/similarity/instance/>
PREFIX psi:<http://www.ontotext.com/graphdb/similarity/psi/>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>

SELECT ?entity ?score ?content ?date {
    ?search a ?index ;
        ?searchType ?query;
        psi:searchPredicate ?psiPredicate;
        :searchParameters ?parameters;
        ?resultType ?result .
    ?result :value ?entity ;
            :score ?score .
    ?entity pubo:content ?content .
    ?entity pubo:formattedDate ?date .
}

With those two queries in place, the data returned from the index should be more useful. Create your hybrid predication index and wait for the process to complete. Then, open it and run a query for “donald tusk”, selecting the express article about “Polexit” from the Autocomplete suggest box. You will discover that the first results are related to the Polexit and dated the same.

Indexing behaviour

When building the Literal index it is good idea to index all literals which will be indexed in the Preciation index or at least all literals of the same type. Lets continue the example above. Let’s imagine that the Literal index we have created includes only the three pieces of news. After that we add the following triple about an imaginary Guardian article and create a Predication index to index all news:

<guardian:poland-grain-exports> <pubo:formattedDate> 12/08/2017

Based on the triples we would expect that the first article is equally similar to the third article and the new one - their contents and dates have little in common. However, depending on the binding method we have used when crating the Predication index we can get higher score for the third article compared to the new one only because the third article has been indexed by the Literal index. This can be easily avoided if either all literals are indexed or at least all dates are indexed.

Manual creation

If you are not using the Similarity page you could pass the following options when creating the indexes: -literal_index true passed to a Text index creates a Literal index -input_index <literaIndex> (replace <literalIndex> with the name of an existing Literal index) passed to a Predication index creates a hybrid index based on a Literal index

Training cycles

When building Text and Predication indexes training cycles could be used to increase the accuracy of the index. The number of training cycles could be set by passing the option:

  • -trainingcycles <numOfCycles> The default number of training cycles is 0.

Text and Predication indexes have quite different implementations of the training cycles.

Text indexes just repeat the same algorithm multiple times which leads to algorithm’s convergence.

Predication indexes initially start the training with a random vector for each entity in the database. On each cycle the initially random elemental vectors are replaced with the product of the previous cycle and the algorithm is run again. In addition to the entity vectors the predicate vectors get trained as well which leads to higher computational time for a cycle compared to the initial run (with trainingcycles = 0).

Note

Each training cycle is computationally and time consuming and higher number of cycles will greatly increase the building time.