Text Mining Plugin

What the plugin does

The GraphDB text mining plugin allows you to consume the output of text mining APIs as SPARQL binding variables. Depending on the annotations returned by the concrete API, the plugin enables multiple use cases like:

  • Generate semantic annotations by linking fragments from texts to knowledge graph entities (entity linking)

  • Transform and filter the text annotations to a concrete RDF data model using SPARQL

  • Enrich the knowledge graph with additional information suggested by the information extraction or invalidate their input

  • Evaluate and control the quality of the text annotations by comparing different versions

  • Implement complex text mining use cases in a combination with the Kafka GraphDB connector

The plugin readily supports the protocols of these services:

In addition, any text mining service that provides response as JSON can be used when you provide a JSLT transformation to remodel the output from the service output to an output understandable by the plugin. See the below examples for querying the Google Cloud Natural Language API and the Refinitiv API using the generic client.

Usage examples

A typical use case would be having a piece of text (for example news content), in which we want to recognize people, organizations, and locations fragments. Ideally, we will link them to entity IRIs that are already known in the knowledge graph, i.e., Wikidata or PermID IRIs providing infinite possibilities for graph enrichment.

Let’s say we have the following text that mentions Dyson as the company “Dyson Ltd.”, the person “James Dyson”, and also only as “Dyson”.

“Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore. The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company.”

Let’s find out what annotations the different services will find in the text.

Note

Please keep in mind that some of the query results provided below may vary as they are dependent on the respective services.

spaCy server

The spaCy server is a containerized HTTP API that provides industrial-strength natural language processing whose named entity recognition (NER) component is used by the plugin.

Currently, the NER pipeline is the only spaCy component supported by the text mining plugin.

Create a spaCy client

  1. Run the spaCy server through its Docker image with the following commands:

    • docker pull neelkamath/spacy-server:2-en_core_web_sm-sense2vec
      
    • docker run --rm -p 8000:8000 neelkamath/spacy-server:2-en_core_web_sm-sense2vec
      
  2. In the Workbench SPARQL editor, execute the following query:

    PREFIX txtm: <http://www.ontotext.com/textmining#>
    PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
    INSERT DATA {
        txtm-inst:localSpacy txtm:connect txtm:Spacy;
                        txtm:service "http://localhost:8000" .
    }
    

    where http://localhost:8000 is the location of the spaCy server set up using the above Docker image.

Note that the sense2vec similarity feature is enabled by default. If your Docker image does not support it or you want to disable it when creating the client, set it to false in the SPARQL query:

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    txtm-inst:localSpacy txtm:connect txtm:Spacy;
                    txtm:service "http://localhost:8000";
                    txtm:sense2vec "false" .
}

Find spaCy entities through GraphDB

The simplest query will return all annotations with their types and offsets. Since spaCy also provides sentence grouping, for each annotation, we can get the text it is found in.

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?sentence ?annotationType ?annotationStart ?annotationEnd
WHERE {
    ?searchDocument a txtm-inst:localSpacy;
                       txtm:text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' .
    graph txtm-inst:localSpacy {
        ?annotatedDocument txtm:annotations ?annotation .
        ?annotation txtm:annotationText ?annotationText ;
                txtm:annotationKey ?annotationKey;
                txtm:annotationType ?annotationType ;
                txtm:annotationStart ?annotationStart ;
                txtm:annotationEnd ?annotationEnd ;
                optional {
            ?annotation txtm:hasSentence/txtm:sentenceText ?sentence.
        }
    }
}

We see that spaCy succeeds in assigning the correct types to each “Dyson” found in the text.

_images/text-mining-spacy.png

Each of the mentioned services attaches to the annotations its own metadata, which can be obtained through the feature predicate. In spaCy’s case, we can reach the sense2vec similarity using the following query:

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?sentence ?annotationType ?annotationStart ?annotationEnd ?feature ?value ?featureItem ?featureValue
WHERE {
    ?searchDocument a txtm-inst:localSpacy;
                    txtm:text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' .
    graph txtm-inst:localSpacy {
        ?annotatedDocument txtm:annotations ?annotation .
        ?annotation txtm:annotationText ?annotationText ;
                txtm:annotationType ?annotationType ;
                txtm:annotationStart ?annotationStart ;
                txtm:annotationEnd ?annotationEnd ;
                optional {
            ?annotation txtm:hasSentence/txtm:sentenceText ?sentence.
        }
        optional {
            ?annotation txtm:features ?item .
            ?item ?feature ?value .
            optional {
                ?value ?featureItem ?featureValue .
            }
        }
    }
}

The sense2vec similarity feature provides us with the additional knowledge that Dyson is somehow related to “vacuums” and “Miele”.

_images/text-mining-spacy-sense2vec.png

GATE Cloud

GATE Cloud is a text analytics as a service that provides various pipelines. Its ANNIE named entity recognizer used by the plugin identifies basic entity types, such as Person, Location, Organization, Money amounts, Time and Date expressions.

Create a GATE client

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    txtm-inst:gateService txtm:connect txtm:Gate;
                     txtm:service "https://cloud-api.gate.ac.uk/process-document/annie-named-entity-recognizer?annotations=:Address&annotations=:Date&annotations=:Location&annotations=:Organization&annotations=:Person&annotations=:Money&annotations=:Percent&annotations=:Sentence" .
}

Obviously, you can provide the annotation types you are interested in using the query parameters.

Find GATE entities through GraphDB

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value
WHERE {
        ?searchDocument a txtm-inst:gateService;
                           txtm:text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' .

    graph txtm-inst:gateService {
        ?annotatedDocument txtm:annotations ?annotation .

        ?annotation txtm:annotationText ?annotationText ;
            txtm:annotationType ?annotationType ;
            txtm:annotationStart ?annotationStart ;
            txtm:annotationEnd ?annotationEnd ;
        optional { ?annotation txtm:features ?item . ?item ?feature ?value }
    }
}

In GATE, sentences are returned as annotations, so they will appear as annotations in the response.

_images/text-mining-gate.png

Tag

Ontotext’s Tag API provides the ability to semantically enrich content of your choice with annotations by discovering mentions of both known and novel concepts.

Based on data from DBpedia and Wikidata, and processed with smart machine learning algorithms, it recognizes mentions of entities such as Person, Organisation, and Location, various relationships between them, as well as general topics and key phrases mentioned. Visit the NOW demonstrator to explore such entities found in news.

Create a TAG client

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    txtm-inst:tagService txtm:connect txtm:Ces;
                    txtm:service "http://tag.ontotext.com/extractor-en/extract" .
}

Find Tag entities through GraphDB

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value
WHERE {
    ?searchDocument a txtm-inst:tagService;
                       txtm:text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company.''' .
    graph txtm-inst:tagService {
        ?annotatedDocument txtm:annotations ?annotation .
        ?annotation txtm:annotationText ?annotationText ;
                txtm:annotationType ?annotationType ;
                txtm:annotationStart ?annotationStart ;
                txtm:annotationEnd ?annotationEnd ;
                {
            ?annotation txtm:features ?item .
            ?item ?feature ?value
        }
    }
}

For some annotations, an exact match to one or more IRIs in the knowledge graph are found and accessible through annotation features along with other annotation metadata.

_images/text-mining-tag.png

Tag also succeeds in assigning the proper type “Person” for “Dyson”.

Here are some details about the features that Tag provides for each annotation:

  • txtm:inst: The id of the concept from the knowledge graph which was assigned to this annotation, or an id of a generated concept in case it is not trusted (see txtm:isTrusted below).

    For example, http://ontology.ontotext.com/resource/9cafep – you can find a short description and news that mention this entity in the NOW web application at http://now.ontotext.com/#/concept&uri=http://ontology.ontotext.com/resource/9cafep, using the IRI value as uri parameter.

  • txtm:class: The class of the concept from the knowledge graph which was assigned to this annotation.

  • txtm:isTrusted: Has value true when the entity is mapped to an existing entity in the database.

  • txtm:isGenerated: Has value true when the annotation has been generated by the pipeline itself, i.e, from NER taggers for which there is no suitable concept in the knowledge graph. Note that generated does not mean that the annotation is not trusted.

  • txtm:relevanceScore: A float number that represents the level of relevancy of the annotation to the target document.

  • txtm:confidence: A float number that represents the confidence score for the annotation to be produced.

Extract Tag entities as web annotation model

The Tag service provides a way to serve entities and their features as RDF. The model is based on the Web annotation data model. The following headers should be passed when creating the Tag client:

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    txtm-inst:tagInstJSONLD txtm:connect txtm:Ces;
                 txtm:service "http://tag.ontotext.com/extractor-en/extract";
                 txtm:header "Accept: application/vnd.ontotext.ces+json+ld";
                 txtm:header "Content-type: application/vnd.ontotext.ces+json+ld".
}

The common model applied for all services is no longer used because you get the Tag response in RDF as is formed by the service.

The following request type (Content-type) and response type (Accept) combinations are supported:

  • Content-type: text/plain - Accept: application/vnd.ontotext.ces+json (this is the default if nothing is specified)

  • Content-type: application/vnd.ontotext.ces+json+ld - Accept: application/vnd.ontotext.ces+json

  • Content-type: application/vnd.ontotext.ces+json+ld - Accept: application/vnd.ontotext.ces+json+ld

Not supported:

  • Content-type: text/plain - Accept: application/vnd.ontotext.ces+json+ld

  • Content-type: application/vnd.ontotext.ces+json

Note

This means that JSON-LD as response type requires that the request is JSON-LD and nothing else. The default text/plain will not work, so when creating the plugin, you need to pass the Content-type explicitly.

When the request type is JSON-LD, the response type can be JSON or JSON-LD.

When using the JSON-LD, the following document features are required. Note that they should be passed using the txtm:features predicate on ?annotatedDocument and in this order:

txtm:features (?id ?title ?type ?author ?source ?category ?date).

Here is a sample query:

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX resource: <http://ontology.ontotext.com/resource/>
PREFIX content: <http://data.ontotext.com/content/>
PREFIX onto: <http://www.ontotext.com/>
CONSTRUCT { ?subject ?predicate ?object }
WHERE {
        ?searchDocument a txtm-inst:tagInstJSONLD;
                           txtm:features (resource:guid-for-the-annotated-document "Dyson Ltd. hires 450 people globally" "Article" "The author"  <https://the_doc_source_uri> content:My_Category "2019-03-01T00:11:15Z");
                           txtm:text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore. The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company. ''' ;
    graph txtm-inst:tagInstJSONLD {
        ?subject ?predicate ?object
    }
}
_images/text-mining-tag-as-web-annotation.png

You can also use the txtm:rawInput predicate to provide your own raw JSON-LD document. The query above will look as follows, and will return the same results:

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX resource: <http://ontology.ontotext.com/resource/>
PREFIX content: <http://data.ontotext.com/content/>
PREFIX onto: <http://www.ontotext.com/>
CONSTRUCT { ?subject ?predicate ?object }
WHERE {
        ?searchDocument a txtm-inst:tagInstJSONLD;
                           txtm:rawInput '''{
     "@id": "resource:some-new-guid-for-the-annotated-document-resource",
     "@graph": [
       {
         "@id": "resource:some-new-guid-for-the-annotated-document-resource",
         "@type": "AnnotatedDocument",
         "document": {
           "@id": "http://ontology.ontotext.com/resource/guid-for-the-annotated-document",
           "@type": "Article",
           "author": "The author",
           "documentSource": "https://the_doc_source_uri",
           "category": "http://data.ontotext.com/content/My_Category",
           "publishDate": "2019-03-01T00:11:15Z",
           "title": "Dyson Ltd. hires 450 people globally",
           "docContent": "Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore. The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company. "
         }
       }
     ],
     "@context": [
       "http://www.w3.org/ns/anno.jsonld",
       {
         "ann": "http://data.ontotext.com/annotation/",
         "ontoa": "http://ontology.ontotext.com/annotation#",
         "ontocontent": "http://ontology.ontotext.com/content#",
         "onto": "http://ontology.ontotext.com/taxonomy/",
         "content": "http://data.ontotext.com/content/",
         "resource": "http://ontology.ontotext.com/resource/",
         "xsd": "http://www.w3.org/2001/XMLSchema#",
         "nif": "http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#",
         "Article": "ontocontent:Article",
         "AnnotatedDocument": "ontocontent:AnnotatedDocument",
         "document": "ontocontent:document",
         "annotations": "ontocontent:annotations",
         "author": {
           "@id": "ontocontent:author",
           "@type": "xsd:string"
         },
         "documentSource": {
           "@id": "ontocontent:source",
           "@type": "@id"
         },
         "category": {
           "@id": "ontocontent:category",
           "@type": "@id"
         },
         "publishDate": {
           "@id": "ontocontent:publishDate",
           "@type": "xsd:dateTime"
         },
         "title": {
           "@id": "ontocontent:title",
           "@type": "xsd:string"
         },
         "docContent": "ontocontent:content",
         "tagType": {
           "@id": "ontoa:tagType",
           "@type": "@id"
         },
         "relevanceScore": {
           "@id": "ontoa:relevanceScore",
           "@type": "xsd:double"
         },
         "confidence": {
           "@id": "nif:confidence",
           "@type": "xsd:double"
         },
         "type": {
           "@id": "ontoa:type",
           "@type": "xsd:string"
         },
         "class": {
           "@id": "ontoa:class",
           "@type": "@id"
         },
         "status": {
           "@id": "ontoa:status",
           "@type": "xsd:string"
         },
         "isTrusted": {
           "@id": "ontoa:isTrusted",
           "@type": "xsd:boolean"
         },
         "isGenerated": {
           "@id": "ontoa:isGenerated",
           "@type": "xsd:boolean"
         },
         "annotationSetName": {
           "@id": "ontoa:annotationSetName",
           "@type": "xsd:string"
         },
         "annotationType": {
           "@id": "ontoa:type",
           "@type": "xsd:string"
         }
       }
     ]
   } ''' ;
    graph txtm-inst:tagInstJSONLD {
        ?subject ?predicate ?object
    }
}

The supported returned response formats are JSON and JSON-LD.

Extract annotations from another NER service

To register a service in the text mining plugin, the service must provide a REST interface with a POST endpoint. The response Content-Type must be application/json. The headers of the POST request are passed using the predicate http://www.ontotext.com/textmining#header. The request body is passed with the predicate http://www.ontotext.com/textmining#text.

The following cURL request:

curl -X POST --header "HEADER1: VALUE1" --header "HEADER2: VALUE2" -d 'body' 'https://endoint.com?queryParam1=param1'

corresponds to the following configuration:

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    inst:myService :connect :Provider;
                      :service "https://endoint.com?queryParam1=param1";
                      :header "HEADER1: VALUE1";
                      :header "HEADER2: VALUE2";
                      :transformation '''
                      ...
                      '''.
}

and to the following query for consuming the annotations:

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value
WHERE {
    ?searchDocument a inst:myService;
                       :text '''body''' .
    graph inst:myService {
        ?annotatedDocument :annotations ?annotation .
        ?annotation :annotationText ?annotationText ;
                :annotationType ?annotationType ;
                :annotationStart ?annotationStart ;
                :annotationEnd ?annotationEnd ;
                {
            ?annotation :features ?item .
            ?item ?feature ?value
        }
    }
}

If we want to extract annotations using another named entity recognition provider, we can do so by creating a client for such services by providing a JSLT transformation. The transformation will convert the JSON returned by the target service to a JSON model understandable for the text mining plugin. The target JSON should look like this:

{
   "content":"",
   "sentences":[ ],
   "features":{ },
   "annotations":[
      {
         "text":"Google",
         "type":"Company",
         "startOffset":78,
         "endOffset":84,
         "confidence":0.0,
         "features":{  }
      }
   ]
}

where the only required part is:

{

   "annotations":[
      {
         "text":"Google",
         "type":"Company",
         "startOffset":78,
         "endOffset":84,
      }
   ]
}

Google Cloud Natural Language API

Google Cloud Natural Language’s API associates information, such as salience and mentions, with annotations, where an annotation represents a phrase in the text that is a known entity, such as a person, an organization, or a location. It also requires a token to access the API.

Create a Google Cloud Natural Language API client
PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
    txtm-inst:myGoogleService txtm:connect txtm:Provider;
                      txtm:service "https://language.googleapis.com/v1/documents:annotateText";
                      txtm:header "Authorization: Bearer <your API token>";
                      txtm:transformation '''
                      {"annotations" : flatten([for (.entities)
                                let type = .type
                                let metadata = .metadata
                                let salience = .salience
                                let mentions = [for (.mentions) {
                                    "type" : $type,
                                    "text" : .text.content,
                                    "startOffset" : .text.beginOffset,
                                    "endOffset" : .text.beginOffset + size(.text.content),
                                    "features" : {
                                          "salience" : $salience,
                                          "metadata" : $metadata
                                    }
                                  }]
                                  $mentions
                      ])}
                      '''.
}
Extract entities from Google Google Cloud Natural Language API

Once created, you can list annotations using a model similar to the other services. Note that you need to provide the input in the way the service expects it. No transformation is applied to the request content.

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value ?featureItem ?featureValue
WHERE {
    ?searchDocument a txtm-inst:myGoogleService;
                       txtm:text '''
        {
        "document":{
        "type":"PLAIN_TEXT",
        "content":"Net income was $9.4 million compared to the prior year of $2.7 million. Google is a big company.
        Revenue exceeded twelve billion dollars, with a loss of $1b"
        }, "features": {'extractEntities': 'true', 'extractSyntax': 'true'},
        'encodingType':'UTF8',
        }
        ''' .
            graph txtm-inst:myGoogleService {
                ?annotatedDocument txtm:annotations ?annotation .
                ?annotation txtm:annotationText ?annotationText ;
                        txtm:annotationType ?annotationType ;
                        txtm:annotationStart ?annotationStart ;
                        txtm:annotationEnd ?annotationEnd ;
                        optional {
                    ?annotation txtm:features ?item .
                    ?item ?feature ?value .
                    optional { ?value ?featureItem ?featureValue . }
                }
            }
        }

The results will look like this:

_images/text-mining-google-nlp.png

Refinitiv API

Refinitiv’s PermIDs are open, permanent, and universal identifiers where underlying attributes capture the context of the identity they each represent.

Create a Refinitiv API client
PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
        txtm-inst:refinitiv txtm:connect txtm:Provider;
                          txtm:service "https://api-eit.refinitiv.com/permid/calais";
                          txtm:header "X-AG-Access-Token: <your_access_token>";
                          txtm:header "Content-Type: text/raw";
                          txtm:header "x-calais-selectiveTags: company,person,industry,socialtags,topic";
                          txtm:header "outputformat: application/json";
                          txtm:transformation '''
                          {
                              "content" : string(.doc.info.document),
                              "rawSource" : string(.),
                              "language" : .doc.meta.language,
                              "features" : {for (.) .key : {for (.value) .key : .value }
                                   if (.value._typeGroup and .value._typeGroup != "entities" and .value._typeGroup != "relations"
                                   and .value._typeGroup != "language" and .value._typeGroup != "versions") },
                              "annotations" : flatten([for (.)
                               if (.value._typeGroup == "entities")
                                 let type = .value._type
                                 let text = .value.name
                                 let features = {for (.value) .key : .value
                                 if (.key != "_type" and .key != "name" and .key != "instances" and .key != "offset")}

                                 let instances = [for (.value.instances){
                                     "type" : $type,
                                     "text" : $text,
                                     "startOffset": .offset,
                                     "endOffset" : .offset + size($text),
                                     "features" : $features
                                   }]
                                   $instances
                                 else if (.value._typeGroup == "relations")
                                  let type = .value._type
                                  let features = {for (.value) .key : .value
                                  if (.key != "_type" and .key != "instances")}
                                    let instances = [for (.value.instances){
                                       "type" : $type,
                                       "text" : .exact,
                                       "startOffset": .offset,
                                       "endOffset" : .offset + size(.exact),
                                       "features" : $features
                                    }]
                                    $instances
                                 else
                                     []
                              ])
                          }
                        '''.

    }
Extract Refinitiv PermID entities
PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?searchDocument ?annotation ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value ?featureItem ?featureValue
WHERE {
       ?searchDocument a txtm-inst:refinitiv;
                        txtm:text '''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company.''' .
  graph txtm-inst:refinitiv {
     ?annotatedDocument txtm:annotations ?annotation .
     ?annotation txtm:annotationText ?annotationText ;
        txtm:annotationType ?annotationType ;
        txtm:annotationStart ?annotationStart ;
        txtm:annotationEnd ?annotationEnd ;
     optional {
        ?annotation txtm:features ?item . ?item ?feature ?value .
        optional { ?value ?featureItem ?featureValue . }
     }
  }
}
_images/text-mining-refinitiv.png

The tricky part of the integration of an arbitrary NER provider is to write the JSLT transformation, but once you get used to the language, you can enrich your text document with any entity provider of your choice, and extend your knowledge graph solely with the power of SPARQL and GraphDB.

Escaping special characters

In the following example:

PREFIX : <http://www.ontotext.com/textmining#>
PREFIX inst: <http://www.ontotext.com/textmining/instance#>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
SELECT ?annotationText ?annotationType ?annotationStart ?annotationEnd ?feature ?value
WHERE {
    ?searchDocument a inst:razor;
                       :text '''
{"text":"Prosecutors want NFL's Peterson arrested on alleged bond violation | Reuters
    Prosecutors want NFL's Peterson arrested on alleged bond violation
    By Eric Kelsey
    (Reuters) - Suspended Minnesota Vikings star Adrian Peterson faced new legal trouble on Thursday after Texas prosecutors in his child abuse case asked a court to order his arrest on a possible drug-related bond violation.
    Peterson, 29, who has been accused of injuring his 4-year-old son while disciplining him with the thin end of a tree branch, allegedly told a drug-testing administrator on Wednesday he had smoked marijuana before submitting to a urinalysis test, court papers said.
    \\"During this process the defendant admitted ... that he smoked a little weed,\\" according to the motion filed by Montgomery County District Attorney Brett Ligon.
    A court date has not been set on the possible bond violation. Peterson's next scheduled court date is Nov. 4.
    It is unclear when a judge would rule on the motion as prosecutors' request to have the current judge recused must be heard first.
    Peterson's attorney, Rusty Hardin, declined to comment until a judge is settled on in the case.
    The Vikings said in a statement they were aware of the allegation and \\"will await the results of that hearing before having further comment.\\"
    The National Football League did not respond to a request for comment.
    Peterson was arrested and posted $15,000 bond on Sept. 12 on a charge of injury to a child. He was later suspended indefinitely with pay by the Vikings until the matter is resolved.
    He has admitted using a switch, the thin end of a tree branch, to discipline his son, but said he was not trying to injure him.
    Peterson could be sentenced to up to two years in prison and fined $10,000 if convicted.
    The charge against Peterson came as the NFL faced public criticism for its handling of a spate of domestic violence cases among its players. A number of corporate sponsors rebuked America's most popular professional sports league, which has overhauled how it deals with player behavior and punishment.
    (Reporting by Eric Kelsey in Los Angeles; Editing by Peter Cooney )
"}
''' .
    graph inst:razor {
        ?annotatedDocument :annotations ?annotation .
        ?annotation :annotationText ?annotationText ;
                :annotationType ?annotationType ;
                :annotationStart ?annotationStart ;
                :annotationEnd ?annotationEnd ;
                {
            ?annotation :features ?item .
            ?item ?feature ?value
        }
    }
}

Quotation marks are escaped as follows:

The Vikings said in a statement they were aware of the allegation and \\"will await the results of that hearing before having further comment.\\"

Since the text enclosed within the ''' marks represents a literal string, SPARQL will store it as is and keep new lines and paragraphs. The only special characters that need to be escaped with a double backslash are the quotation marks: \\”. This will form the values of the valid JSON that the plugin will send to the service.

Compare annotations between services

The text mining plugin generates meaningful IRIs for the ?annotatedDocument and ?annotation variables. It provides the additional txtm:annotationKey predicate that binds to the ?annotationKey variable an IRI for the annotation based on the text and offsets, meaning that regardless of the service that generated the annotation, the same pieces of text will have the same ?annotationKey IRIs. This can be used to compare annotations over the same piece of text provided by different services.

The following query compares annotation types obtained from spaCy and Tag for annotations that have the same key and text, meaning that they refer to the same piece of text.

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT ?spacyDocument ?tagDocument ?spacyAnnotation ?tagAnnotation ?spacyType ?tagType ?annotationKey ?annotationText
WHERE {
    BIND ('''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' as ?text)
    ?searchDocument1 a txtm-inst:localSpacy;
                       txtm:text ?text.
    graph txtm-inst:localSpacy {
        ?spacyDocument txtm:annotations ?spacyAnnotation .

        ?spacyAnnotation txtm:annotationText ?annotationText ;
                     txtm:annotationKey ?annotationKey;
                txtm:annotationType ?spacyType .
    }

    ?searchDocument2 a txtm-inst:tagService;
                       txtm:text ?text .
    graph txtm-inst:tagService {
        ?tagDocument txtm:annotations ?tagAnnotation .
        ?tagAnnotation txtm:annotationText ?annotationText ;
                     txtm:annotationKey ?annotationKey;
                txtm:annotationType ?tagType .
    }
}

Which will return:

_images/text-mining-compare-annotations.png

The IRIs generated by the text mining plugin have the following meaning:

  • ?annotatedDocument (?tagDocument or ?spacyDocument in the above query): <http://www.ontotext.com/textmining/document/<md5-content>> where md5-content is the MD5 code of the document content.

    For example: <http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847>.

    Note that document IRIs will be the same for the same pieces of text, regardless of the service.

  • ?annotation: <http://www.ontotext.com/textmining/document/<md5-content>/annotation/<start>/<end>/<service-name>/<index>>

    • <start>/<end>: The start/end offsets of the annotation in the text.

    • <service-name>: The name of the service that provided the annotation.

    • <index>: A unique number of the annotation within the document, meaning that if there are different annotation for the same pieces of text, they will have different IRIs.

    For example:

    <http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847/annotation/102/111/localSpacy/4>

    <http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847/annotation/102/111>

  • ?annotationKey: <http://www.ontotext.com/textmining/document/<md5-content>/annotation/<start>/<end>>: The annotation key IRI marks only a piece of text in the document and can be used to find annotation over the same piece of text, but provided by different services.

    For example: <http://www.ontotext.com/textmining/document/ffa3feed18dacea1c195492cc1c06847/annotation/102/111>

Enrich documents with mentions of known entities

Using the Tag txtm:exactMatch feature and our own mentions predicate, we can generate the following triples and enrich our dataset with entities from DBpedia.

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
PREFIX my-kg: <http://my.knowledge.graph.com/textmining#>
CONSTRUCT {
    ?tagDocument my-kg:mentions ?value
}
WHERE {
    BIND ('''Dyson Ltd. plans to hire 450 people globally, with more than half the recruits in its headquarters in Singapore.
The company best known for its vacuum cleaners and hand dryers will add 250 engineers in the city-state. This comes short before the founder James Dyson announced he is moving back to the UK after moving residency to Singapore. Dyson, a prominent Brexit supporter who is worth US$29 billion, faced criticism from British lawmakers for relocating his company''' as ?text)
    ?searchDocument a txtm-inst:tagService;
                       txtm:text ?text .
    graph txtm-inst:tagService {
        ?tagDocument txtm:annotations ?tagAnnotation .
        ?tagAnnotation txtm:features ?item .
        ?item txtm:exactMatch ?value
    }
}

Which will return:

_images/text-mining-entity-linking.png

Of course, the power of RDF allows you to construct any graph you want based on the response from the named entity recognition service.

Error handling

Let’s say you have multiple documents with content that you want to send for annotation, for example documents from your own knowledge graph. For the example to work, insert the following documents in your repository:

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX my-kg: <http://my.knowledge.graph.com/textmining#>

INSERT DATA
      {
      GRAPH <http://my.knowledge.graph.com> {
        my-kg:doc1 my-kg:content "SOFIA, March 14 (Reuters) - Bulgaria expects Azeri state energy company SOCAR to start investing in the Balkan country's retail gas distribution network this year, Prime Minister said on Thursday".
        my-kg:doc2 my-kg:content "Bulgaria is looking to secure gas supplies for its planned gas hub at the Black Sea port of Varna and Borissov said he had discussed the possibility of additional Azeri gas shipments for the plan.".
        my-kg:doc3 my-kg:content "In the Sunny Beach resort, this one-bedroom apartment is 150m from the sea. It is in the Yassen complex, which has a communal pool and gardens. On the third floor, the 66sq m (718sq ft) apartment has a livingroom, with kitchen, that opens to a balcony overlooking the pool. There are also a bedroom and bathroom. The property is being sold with furniture. The service charge is €8 a square metre, making it about €528. Burgas Airport is about 12km away. Varna is 40km away.".
      }
}

You can send all of them for annotation with a single query. By default, if the service fails for one document, the whole query will fail. As a result, you will miss the results for the documents that were successfully annotated. To prevent this from happening, you can use the txtm:serviceErrors predicate that defines a maximum number of errors allowed before the query fails, where -1 means that an infinite number of errors is allowed. As a result of the following query, you will either get an error for the document, or its annotations.

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
PREFIX pubo: <http://ontology.ontotext.com/publishing#>
PREFIX my-kg: <http://my.knowledge.graph.com/textmining#>

SELECT ?content ?annotationText ?errorFeature
WHERE {
    ?myDocument my-kg:content ?content.

    ?searchDocument a txtm-inst:localSpacy;
                       txtm:text ?content;
                       txtm:serviceErrors -1 .
    graph txtm-inst:localSpacy {
        OPTIONAL {
            ?annotatedDocument txtm:annotations ?annotation .
            ?annotation txtm:annotationText ?annotationText .
        }
        OPTIONAL {
            ?annotatedDocument txtm:features ?docFeature .
            ?docFeature ?errorFeature ?errorFeature.
        }
    }
}

The following results will be returned if the spaCy service successfully annotates the first document, but is then stopped. We can simulate this by stopping the spaCy Docker during the query execution (Ctrl+C in the terminal where the Docker is running). The error message is returned as a document feature.

_images/text-mining-error-handling.png

Manage text mining instances

Use the queries below to explore the instances of text mining clients you have in the repository with their configurations, as well as to remove them.

List all clients

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT * where {
    ?instance a txtm:Service .
}

Get configuration for a client

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
SELECT * WHERE {
   txtm-inst:localSpacy ?p ?o .
}

Drop an instance

PREFIX txtm: <http://www.ontotext.com/textmining#>
PREFIX txtm-inst: <http://www.ontotext.com/textmining/instance#>
INSERT DATA {
   txtm-inst:localSpacy txtm:dropService "".
}

Monitor annotation progress

If you are annotating multiple documents in one transaction, you may want to get feedback on the progress. This is done by setting the log level of the text mining plugin to DEBUG in the conf/logback.xml file of the GraphDB distribution:

<logger name="com.ontotext.graphdb.plugins.textmining" level="DEBUG"/>

You will see a message for each document sent for annotation in the GraphDB main log file in the logs directory.

[DEBUG] 2021-05-19 08:39:40,893 [repositories/ff-news | c.o.g.p.t.c.ClientBase] Annotating document content starting with: "Australia's Cardinal Pell sentenced to six years jail for sexually... MELBOURNE (Reuters) - Former ..." with length: 911

[DEBUG] 2021-05-19 08:39:41,851 [repositories/ff-news | c.o.g.p.t.c.ClientBase] Annotating document content starting with: "Google engineer calls for accessibility in design Laura D'Aquila, a software engineer at Google and..." with length: 2455

[DEBUG] 2021-05-19 08:39:45,610 [repositories/ff-news | c.o.g.p.t.c.ClientBase] Annotating document content starting with: "NBA Fines Russell Westbrook $25,000, Utah Jazz Permanently Bans Fan Following Verbal Altercation Op..." with length: 4932