Text mining plugin

What the plugin does

The GraphDB text mining plugin allows you to consume the output of text mining APIs as SPARQL binding variables. Depending on the annotations returned by the concrete API, the plugin enables multiple use cases like:

  • Generate semantic annotations by linking fragments from texts to knowledge graph entities (entity linking)

  • Transform and filter the text annotations to a concrete RDF data model using SPARQL

  • Enrich the knowledge graph with additional information suggested by the information extraction or invalidate their input

  • Evaluate and control the quality of the text annotations by comparing different versions

  • Implement complex text mining use cases in a combination with the Kafka GraphDB connector

The plugin readily supports the protocols of these services:

In addition, any text mining service that provides response as JSON can be used when you provide a JSLT transformation to remodel the output from the service output to an output understandable by the plugin. See the below examples for querying the Google Cloud Natural Language API and the Refinitiv API using the generic client.

Usage examples

A typical use case would be having a piece of text (for example news content), in which we want to recognize people, organizations, and locations fragments. Ideally, we will link them to entity IRIs that are already known in the knowledge graph, i.e., Wikidata or PermID IRIs providing infinite possibilities for graph enrichment.

Let’s say we have the following text that mentions Dyson as the company “Dyson Ltd.”, the person “James Dyson”, and also only as “Dyson”.

“Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore. The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company.”

Let’s find out what annotations the different services will find in the text.

spaCy server

The spaCy server is a containerized HTTP API that provides industrial-strength natural language processing whose named entity recognition (NER) component is used by the plugin.

Currently, the NER pipeline is the only spaCy component supported by the text mining plugin.

Create a spaCy client

  1. Run the spaCy server through its Docker image with the following commands:

    • sudo docker pull neelkamath/spacy-server:2-en_core_web_sm-sense2vec
      
    • docker run --rm -p 8000:8000 neelkamath/spacy-server:2-en_core_web_sm-sense2vec
      
  2. In the Workbench SPARQL editor, execute the following query:

    PREFIX : <http://www.ontotext.com/textmining#>
    PREFIX inst: <http://www.ontotext.com/textmining/instance#>
    INSERT DATA {
        inst:localSpacy :connect :Spacy;
                        :service "http://localhost:8000" .
    }
    

    where http://localhost:8000 is the location of the spaCy server set up using the above Docker image.

Note that the sense2vec similarity feature is enabled by default. If your Docker image does not support it or you want to disable it when creating the client, set it to false in the SPARQL query:

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    inst:localSpacy :connect :Spacy;
                    :service "http://localhost:8000";
                    :sense2vec "false" .
}

Find spaCy entities through GraphDB

The simplest query will return all annotations with their types and offsets. Since spaCy also provides sentence grouping, for each annotation, we can get the text it is found in.

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?sentence ?annotationType ?annotationStart ?annotationEnd
WHERE {
    ?searchDocument a inst:localSpacy;
                       :text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' .
    graph inst:localSpacy {
        ?annotatedDocument :annotations ?annotation .
        ?annotation :annotationText ?annotationText ;
                :annotationKey ?annotationKey;
                :annotationType ?annotationType ;
                :annotationStart ?annotationStart ;
                :annotationEnd ?annotationEnd ;
                optional {
            ?annotation :hasSentence/:sentenceText ?sentence.
        }
    }
}

We see that spaCy succeeds in assigning the correct types to each “Dyson” found in the text.

_images/text-mining-spacy.png

Each of the mentioned services attaches to the annotations its own metadata, which can be obtained through the feature predicate. In spaCy’s case, we can reach the sense2vec similarity using the following query:

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?sentence ?annotationType ?annotationStart ?annotationEnd ?feature ?value ?featureItem ?featureValue
WHERE {
    ?searchDocument a inst:localSpacy;
                    :text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' .
    graph inst:localSpacy {
        ?annotatedDocument :annotations ?annotation .
        ?annotation :annotationText ?annotationText ;
                :annotationType ?annotationType ;
                :annotationStart ?annotationStart ;
                :annotationEnd ?annotationEnd ;
                optional {
            ?annotation :hasSentence/:sentenceText ?sentence.
        }
        optional {
            ?annotation :features ?item .
            ?item ?feature ?value .
            optional {
                ?value ?featureItem ?featureValue .
            }
        }
    }
}

The sense2vec similarity feature provides us with the additional knowledge that Dyson is somehow related to “vacuums” and “Miele”.

_images/text-mining-spacy-sense2vec.png

GATE Cloud

GATE Cloud is a text analytics as a service that provides various pipelines. Its ANNIE named entity recognizer used by the plugin identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and Date expressions.

Create a GATE client

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    inst:gateService :connect :Gate;
                     :service "https://cloud-api.gate.ac.uk/process-document/annie-named-entity-recognizer?annotations=:Address&annotations=:Date&annotations=:Location&annotations=:Organization&annotations=:Person&annotations=:Money&annotations=:Percent&annotations=:Sentence" .
}

Obviously, you can provide the annotation types you are interested in using the query parameters.

Find GATE entities through GraphDB

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value
WHERE {
        ?searchDocument a inst:gateService;
                           :text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' .

    graph inst:gateService {
        ?annotatedDocument :annotations ?annotation .

        ?annotation :annotationText ?annotationText ;
            :annotationType ?annotationType ;
            :annotationStart ?annotationStart ;
            :annotationEnd ?annotationEnd ;
        optional { ?annotation :features ?item . ?item ?feature ?value }
    }
}

In GATE, sentences are returned as annotations, so they will appear as annotations in the response.

_images/text-mining-gate.png

Tag

Ontotext’s Tag API provides the ability to semantically enrich content of your choice with annotations by discovering mentions of both known and novel concepts.

Based on data from DBpedia and Wikidata, and processed with smart machine learning algorithms, it recognizes mentions of entities such as Person, Organisation, and Location, various relationships between them, as well as general topics and key phrases mentioned. Visit the NOW demonstrator to explore such entities found in news.

Create a TAG client

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    inst:tagService :connect :Ces;
                    :service "http://tag.ontotext.com/extractor-en/extract" .
}

Find Tag entities through GraphDB

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value
WHERE {
    ?searchDocument a inst:tagService;
                       :text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company.''' .
    graph inst:tagService {
        ?annotatedDocument :annotations ?annotation .
        ?annotation :annotationText ?annotationText ;
                :annotationType ?annotationType ;
                :annotationStart ?annotationStart ;
                :annotationEnd ?annotationEnd ;
                {
            ?annotation :features ?item .
            ?item ?feature ?value
        }
    }
}

For some annotations, an exact match to one or more IRIs in the knowledge graph are found and accessible through annotation features along with other annotation metadata.

_images/text-mining-tag.png

Tag also succeeds in assigning the proper type “Person” for “Dyson”.

Here are some details about the features that Tag provides for each annotation:

  • :inst: The id of the concept from the knowledge graph which was assigned to this annotation, or an id of a generated concept in case it is not trusted (see isTrusted below).

    For example, http://ontology.ontotext.com/resource/9cafep – you can find a short description and news that mention this entity in the NOW web application at http://now.ontotext.com/#/concept&uri=http://ontology.ontotext.com/resource/9cafep, using the IRI value as uri parameter.

  • :class: The class of the concept from the knowledge graph which was assigned to this annotation.

  • :isTrusted: Has value true when the entity is mapped to an existing entity in the database.

  • :isGenerated: Has value true when the annotation has been generated by the pipeline itself, i.e, from NER taggers for which there is no suitable concept in the knowledge graph. Note that generated does not mean that the annotation is not trusted.

  • :relevanceScore: A float number that represents the level of relevancy of the annotation to the target document.

  • :confidence: A float number that represents the confidence score for the annotation to be produced.

Extract Tag entities as web annotation model

The Tag service provides a way to serve entities and their features as RDF. The model is based on the Web annotation data model. The following headers should be passed when creating the Tag client:

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    inst:tagInstJSONLD :connect :Ces;
                 :service "http://tag.ontotext.com/extractor-en/extract";
                 :header "Accept: application/vnd.ontotext.ces+json+ld";
                 :header "Content-type: application/vnd.ontotext.ces+json+ld".
}

The common model applied for all services is no longer used because you get the Tag response in RDF as is formed by the service.

When using the JSON-LD, the following document features are required. Note that they should be passed using the :features predicate on ?annotatedDocument and in this order:

:features (?id ?title ?type ?author ?source ?category ?date).

Here is a sample query:

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX resource: <http://ontology.ontotext.com/resource/>
PREFIX content: <http://data.ontotext.com/content/>
PREFIX onto: <http://www.ontotext.com/>
CONSTRUCT { ?subject ?predicate ?object }
WHERE {
        ?searchDocument a inst:tagInstJSONLD;
                           :features (resource:guid-for-the-annotated-document "Dyson Ltd. hires 450 people globally" "Article" "The author"  <https://the_doc_source_uri> content:My_Category "2019-03-01T00:11:15Z");
                           :text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore. The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company. ''' ;
    graph inst:tagInstJSONLD {
        ?subject ?predicate ?object
    }
}
_images/text-mining-tag-as-web-annotation.png

Extract annotations from another NER service

If we want to extract annotations using another named entity recognition provider, we can do so by creating a client for such services by providing a JSLT transformation. The transformation will convert the JSON returned by the target service to a JSON model understandable for the text mining plugin. The target JSON should look like this:

{
   "content":"",
   "sentences":[ ],
   "features":{ },
   "annotations":[
      {
         "text":"Google",
         "type":"Company",
         "startOffset":78,
         "endOffset":84,
         "confidence":0.0,
         "features":{  }
      }
   ]
}

where the only required part is:

{

   "annotations":[
      {
         "text":"Google",
         "type":"Company",
         "startOffset":78,
         "endOffset":84,
      }
   ]
}

Google Cloud Natural Language API

Google Cloud Natural Language’s API associates information, such as salience and mentions, with annotations, where an annotation represents a phrase in the text that is a known entity, such as a person, an organization, or a location. It also requires a token to access the API.

Create a Google Cloud Natural Language API client
PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    inst:myGoogleService :connect :Provider;
                      :service "https://language.googleapis.com/v1/documents:annotateText";
                      :header "Authorization: Bearer <your API token>";
                      :transformation '''
                      {"annotations" : flatten([for (.entities)
                                let type = .type
                                let metadata = .metadata
                                let salience = .salience
                                let mentions = [for (.mentions) {
                                    "type" : $type,
                                    "text" : .text.content,
                                    "startOffset" : .text.beginOffset,
                                    "endOffset" : .text.beginOffset + size(.text.content),
                                    "features" : {
                                          "salience" : $salience,
                                          "metadata" : $metadata
                                    }
                                  }]
                                  $mentions
                      ])}
                      '''.
}
Extract entities from Google Google Cloud Natural Language API

Once created, you can list annotations using a model similar to the other services. Note that you need to provide the input in the way the service expects it. No transformation is applied to the request content.

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value ?featureItem ?featureValue
WHERE {
    ?searchDocument a inst:myGoogleService;
                       :text '''
        {
        "document":{
        "type":"PLAIN_TEXT",
        "content":"Net income was $9.4 million compared to the prior year of $2.7 million. Google is a big company.
        Revenue exceeded twelve billion dollars, with a loss of $1b"
        }, "features": {'extractEntities': 'true', 'extractSyntax': 'true'},
        'encodingType':'UTF8',
        }
        ''' .
            graph inst:myGoogleService {
                ?annotatedDocument :annotations ?annotation .
                ?annotation :annotationText ?annotationText ;
                        :annotationType ?annotationType ;
                        :annotationStart ?annotationStart ;
                        :annotationEnd ?annotationEnd ;
                        optional {
                    ?annotation :features ?item .
                    ?item ?feature ?value .
                    optional { ?value ?featureItem ?featureValue . }
                }
            }
        }

The results will look like this:

_images/text-mining-google-nlp.png

Refinitiv API

Refinitiv’s PermIDs are open, permanent, and universal identifiers where underlying attributes capture the context of the identity they each represent.

Create a Refinitiv API client
PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
        inst:refinitiv :connect :Provider;
                          :service "https://api-eit.refinitiv.com/permid/calais";
                          :header "X-AG-Access-Token: <your_access_token>";
                          :header "Content-Type: text/raw";
                          :header "x-calais-selectiveTags: company,person,industry,socialtags,topic";
                          :header "outputformat: application/json";
                          :transformation '''
                          {
                              "content" : string(.doc.info.document),
                              "rawSource" : string(.),
                              "language" : .doc.meta.language,
                              "features" : {for (.) .key : {for (.value) .key : .value }
                                   if (.value._typeGroup and .value._typeGroup != "entities" and .value._typeGroup != "relations"
                                   and .value._typeGroup != "language" and .value._typeGroup != "versions") },
                              "annotations" : flatten([for (.)
                               if (.value._typeGroup == "entities")
                                 let type = .value._type
                                 let text = .value.name
                                 let features = {for (.value) .key : .value
                                 if (.key != "_type" and .key != "name" and .key != "instances" and .key != "offset")}

                                 let instances = [for (.value.instances){
                                     "type" : $type,
                                     "text" : $text,
                                     "startOffset": .offset,
                                     "endOffset" : .offset + size($text),
                                     "features" : $features
                                   }]
                                   $instances
                                 else if (.value._typeGroup == "relations")
                                  let type = .value._type
                                  let features = {for (.value) .key : .value
                                  if (.key != "_type" and .key != "instances")}
                                    let instances = [for (.value.instances){
                                       "type" : $type,
                                       "text" : .exact,
                                       "startOffset": .offset,
                                       "endOffset" : .offset + size(.exact),
                                       "features" : $features
                                    }]
                                    $instances
                                 else
                                     []
                              ])
                          }
                        '''.

    }
Extract Refinitiv PermID entities
PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?searchDocument ?annotation ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value ?featureItem ?featureValue
WHERE {
       ?searchDocument a inst:refinitiv;
                        :text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company.''' .
  graph inst:refinitiv {
     ?annotatedDocument :annotations ?annotation .
     ?annotation :annotationText ?annotationText ;
        :annotationType ?annotationType ;
        :annotationStart ?annotationStart ;
        :annotationEnd ?annotationEnd ;
     optional {
        ?annotation :features ?item . ?item ?feature ?value .
        optional { ?value ?featureItem ?featureValue . }
     }
  }
}
_images/text-mining-refinitiv.png

The tricky part of the integration of an arbitrary NER provider is to write the JSLT transformation, but once you get used to the language, you can enrich your text document with any entity provider of your choice, and extend your knowledge graph solely with the power of SPARQL and GraphDB.

Compare annotations between services

The text mining plugin generates meaningful IRIs for the ?annotatedDocument and ?annotation variables. It provides the additional :annotationKey predicate that binds to the ?annotationKey variable an IRI for the annotation based on the text and offsets, meaning that regardless of the service that generated the annotation, the same pieces of text will have the same ?annotationKey IRIs. This can be used to compare annotations over the same piece of text provided by different services.

The following query compares annotation types obtained from spaCy and Tag for annotations that have the same key and text, meaning that they refer to the same piece of text.

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?spacyDocument ?tagDocument ?spacyAnnotation ?tagAnnotation ?spacyType ?tagType ?annotationKey ?annotationText
WHERE {
    BIND ('''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' as ?text)
    ?searchDocument1 a inst:localSpacy;
                       :text ?text.
    graph inst:localSpacy {
        ?spacyDocument :annotations ?spacyAnnotation .

        ?spacyAnnotation :annotationText ?annotationText ;
                     :annotationKey ?annotationKey;
                :annotationType ?spacyType .
    }

    ?searchDocument2 a inst:tagService;
                       :text ?text .
    graph inst:tagService {
        ?tagDocument :annotations ?tagAnnotation .
        ?tagAnnotation :annotationText ?annotationText ;
                     :annotationKey ?annotationKey;
                :annotationType ?tagType .
    }
}

Which will return:

_images/text-mining-compare-annotations.png

The IRIs generated by the text mining plugin have the following meaning:

  • ?annotatedDocument (?tagDocument or ?spacyDocument in the above query): <http://www.ontotext.com/textmining/document/<md5-content>> where md5-content is the MD5 code of the document content.

    For example: <http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847>.

    Note that document IRIs will be the same for the same pieces of text, regardless of the service.

  • ?annotation: <http://www.ontotext.com/textmining/document/<md5-content>/annotation/<start>/<end>/<service-name>/<index>>

    • <start>/<end>: The start/end offsets of the annotation in the text.

    • <service-name>: The name of the service that provided the annotation.

    • <index>: A unique number of the annotation within the document, meaning that if there are different annotation for the same pieces of text, they will have different IRIs.

    For example:

    <http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847/annotation/102/111/localSpacy/4>

    <http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847/annotation/102/111>

  • ?annotationKey: <http://www.ontotext.com/textmining/document/<md5-content>/annotation/<start>/<end>>: The annotation key IRI marks only a piece of text in the document and can be used to find annotation over the same piece of text, but provided by different services.

    For example: <http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847/annotation/102/111>

Enrich documents with mentions of known entities

Using the Tag :exactMatch feature and our own mentions predicate, we can generate the following triples and enrich our dataset with entities from DBpedia.

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
PREFIX my-kg: <http://my.knowledge.graph.com/textmining#>
CONSTRUCT {
    ?tagDocument my-kg:mentions ?value
}
WHERE {
    BIND ('''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' as ?text)
    ?searchDocument a inst:tagService;
                       :text ?text .
    graph inst:tagService {
        ?tagDocument :annotations ?tagAnnotation .
        ?tagAnnotation :features ?item .
        ?item :exactMatch ?value
    }
}

Which will return:

_images/text-mining-entity-linking.png

Of course, the power of RDF allows you to construct any graph you want based on the response from the named entity recognition service.

Error handling

Let’s say you have multiple documents with content that you want to send for annotation, for example documents from your own knowledge graph. For the example to work, insert the following documents in your repository:

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX my-kg: <http://my.knowledge.graph.com/textmining#>

INSERT DATA
      {
      GRAPH <http://my.knowledge.graph.com> {
        my-kg:doc1 my-kg:content "SOFIA, March 14 (Reuters) - Bulgaria expects Azeri state energy company SOCAR to start investing in the Balkan country's retail gas distribution network this year, Prime Minister said on Thursday".
        my-kg:doc2 my-kg:content "Bulgaria is looking to secure gas supplies for its planned gas hub at the Black Sea port of Varna and Borissov said he had discussed the possibility of additional Azeri gas shipments for the plan.".
        my-kg:doc3 my-kg:content "In the Sunny Beach resort, this one-bedroom apartment is 150m from the sea. It is in the Yassen complex, which has a communal pool and gardens. On the third floor, the 66sq m (718sq ft) apartment has a livingroom, with kitchen, that opens to a balcony overlooking the pool. There are also a bedroom and bathroom. The property is being sold with furniture. The service charge is €8 a square metre, making it about €528. Burgas Airport is about 12km away. Varna is 40km away.".
      }
}

You can send all of them for annotation with a single query. By default, if the service fails for one document, the whole query will fail. As a result, you will miss the results for the documents that were successfully annotated. To prevent this from happening, you can use the :serviceErrors predicate that defines a maximum number of errors allowed before the query fails, where -1 means that an infinite number of errors is allowed. As a result of the following query, you will either get an error for the document, or its annotations.

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX my-kg: <http://my.knowledge.graph.com/textmining#>

SELECT ?content ?annotationText ?errorFeature
WHERE {
    ?myDocument my-kg:content ?content.

    ?searchDocument a inst:localSpacy;
                       :text ?content;
                       :serviceErrors -1 .
    graph inst:localSpacy {
        OPTIONAL {
            ?annotatedDocument :annotations ?annotation .
            ?annotation :annotationText ?annotationText .
        }
        OPTIONAL {
            ?annotatedDocument :features ?docFeature .
            ?docFeature ?errorFeature ?errorFeature.
        }
    }
}

The following results will be returned if the spaCy service successfully annotates the first document, but is then stopped. We can simulate this by stopping the spaCy Docker during the query execution (Ctrl+C in the terminal where the Docker is running). The error message is returned as a document feature.

_images/text-mining-error-handling.png

Manage text mining instances

Use the queries below to explore the instances of text mining clients you have in the repository with their configurations, as well as to remove them.

List all clients

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
SELECT * where {
    ?instance a :Service .
}

Get configuration for a client

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
SELECT * WHERE {
   inst:localSpacy ?p ?o .
}

Drop an instance

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
   inst:localSpacy :dropService "".
}

Monitor annotation progress

If you are annotating multiple documents in one transaction, you may want to get feedback on the progress. This is done by setting the log level of the text mining plugin to DEBUG in the conf/logback.xml file of the GraphDB distribution:

<logger name="com.ontotext.graphdb.plugins.textmining" level="DEBUG"/>

You will see a message for each document sent for annotation in the GraphDB main log file in the logs directory.

[DEBUG] 2021-05-19 08:39:40,893 [repositories/ff-news | c.o.g.p.t.c.ClientBase] Annotating document content starting with: "Australia's Cardinal Pell sentenced to six years jail for sexually... MELBOURNE (Reuters) - Former ..." with length: 911

[DEBUG] 2021-05-19 08:39:41,851 [repositories/ff-news | c.o.g.p.t.c.ClientBase] Annotating document content starting with: "Google engineer calls for accessibility in design Laura D'Aquila, a software engineer at Google and..." with length: 2455

[DEBUG] 2021-05-19 08:39:45,610 [repositories/ff-news | c.o.g.p.t.c.ClientBase] Annotating document content starting with: "NBA Fines Russell Westbrook $25,000, Utah Jazz Permanently Bans Fan Following Verbal Altercation Op..." with length: 4932