System monitoring

GraphDB offers several options for system monitoring described in detail below.

Workbench monitoring

In the respective tabs under Monitor ‣ Resources in the GraphDB Workbench, you can monitor the most important hardware information as well as other application-related metrics:

  • Resource monitoring: system CPU load, file descriptors, heap memory usage, off-heap memory usage, and disk storage.

  • Performance (per repository): queries, global page cache, entity pool, and transactions and connections.

  • Cluster health (in case a cluster exists).

_images/workbench-resource-monitoring.png

Prometheus monitoring

The GraphDB REST API exposes several monitoring endpoints suitable for scraping by Prometheus. They return a suitable data format when the request has an Accept header of the type text/plain, which is the default type for Prometheus scrapers.

GraphDB structures monitoring API

The /rest/monitor/structures endpoint enables you to monitor GraphDB structures – the global page cache and the entity pool. This provides a better understanding of whether the current GraphDB configuration is optimal for your specific use case (e.g., repository size, query complexity, etc.)

The current state of the global page cache and the entity pool are returned via the following metrics:

Parameter

Description

graphdb_cache_hit

GraphDB’s global page cache hits count. Along with the global page cache miss count, this metric can be used to diagnose a small or oversized global page cache.

  • In ideal conditions, the percentage of hits should be over 96%.

  • If it is below 96%, it might be a good idea to increase its size.

  • If it is over 99%, it might be worth experimenting with a smaller global page cache size.

graphdb_cache_miss

GraphDB’s global page cache miss count.

Infrastructure statistics monitoring API

The /rest/monitor/infrastructure endpoint enables you to monitor GraphDB’s infrastructure so as to have better visibility of the hardware resources usage. It returns the most important hardware information and several application-related metrics:

Parameter

Description

graphdb_open_file_descriptors

Count of currently open file descriptors. This helps diagnose slow-downs of the system or a slow storage if the number remains high for a longer period of time.

graphdb_cpu_load

Shows the current CPU load for the entire system in %.

graphdb_heap_max_mem

Maximum available memory for the GraphDB instance. Returns -1 if the maximum memory size is undefined.

graphdb_heap_init_mem

Initial amount of memory (controlled by -Xms) in bytes.

graphdb_heap_committed_mem

Current committed memory in bytes.

graphdb_heap_used_mem

Current used memory in bytes. Along with the rest of the memory-related properties, this can be used to detect memory issues.

graphdb_mem_garbage_collections_count

Count of full garbage collections from the starting of the GraphDB instance. This metric is useful for detecting memory usage issues and system “freezes”.

graphdb_nonheap_init_mem

Off-heap initial memory in bytes.

graphdb_nonheap_max_mem

Maximum direct memory. Returns -1 if undefined.

graphdb_nonheap_committed_mem

Current off-heap committed memory in bytes.

graphdb_nonheap_used_mem

Current off-heap used memory in bytes.

graphdb_data_dir_used

Used storage space on the partition where the data directory sits, in bytes. This is useful for detecting a soon-out-of-hard-disk-space issue along with the free storage metric.

graphdb_data_dir_free

Free storage space on the partition where the data directory sits, in bytes.

graphdb_logs_dir_used

Used storage space on the partition where the logs directory sits, in bytes. This is useful for detecting a soon-out-of-hard-disk-space issue along with the free storage metric.

graphdb_logs_dir_free

Free storage space on the partition where the logs directory sits, in bytes.

graphdb_work_dir_used

Used storage space on the partition where the work directory sits, in bytes. This is useful for detecting a soon-out-of-hard-disk-space issue along with the free storage metric.

graphdb_work_dir_free

Free storage space on the partition where the work directory sits, in bytes.

graphdb_threads_count

Current used threads count.

Cluster statistics monitoring API

Via the /rest/monitor/cluster endpoint, you can monitor GraphDB’s cluster statistics in order to diagnose problems and cluster slow-downs more easily. The endpoint returns several cluster-related metrics, and will not return anything if a cluster is not created.

Parameter

Description

graphdb_leader_elections_count

Count of leader elections from cluster creation. If there are a lot of leader elections, this might mean an unstable cluster setup with nodes that are not always properly operating.

graphdb_failure_recoveries_count

Count of total failure recoveries in the cluster from cluster creation. Includes failed and successful recoveries. If there are a lot of recoveries, this indicates issues with the cluster stability.

graphdb_failed_transactions_count

Count of failed transactions in the cluster.

graphdb_nodes_in_cluster

Total nodes count in the cluster.

graphdb_nodes_in_sync

Count of nodes that are currently in-sync. If a lower number than the total nodes count is reported, this means that there are nodes that are either out-of-sync, disconnected, or syncing.

graphdb_nodes_out_of_sync

Count of nodes that are out-of-sync. If there are such nodes for a longer period of time, this might indicate a failure in one or more nodes.

graphdb_nodes_disconnected

Count of nodes that are disconnected. If there are such nodes for a longer period of time, this might indicate a failure in one or more nodes.

graphdb_nodes_syncing

Count of nodes that are currently syncing. If there are such nodes for a longer period of time, this might indicate a failure in one or more nodes.

Query statistics monitoring API

Via the /rest/monitor/repository/{repositoryID} endpoint, you can monitor GraphDB’s query and transaction statistics in order to obtain a better understanding of the slow queries, suboptimal queries, active transactions, and open connections. This information helps in identifying possible issues more easily.

The endpoint exists for each repository, and a scrape configuration must be created for each repository that you want to monitor. Normally, repositories are not created or deleted frequently, so the Prometheus scrape configurations should not be changed often either.

Important

In order for GraphDB to be able to return these metrics, the repository must be initialized.

The following metrics are exposed:

Parameter

Description

graphdb_slow_queries_count

Count of slow queries executed on the repository. The counter is reset when a GraphDB instance is restarted. If the count of slow queries is high, this might indicate a setup issue, unoptimized queries, or not good enough hardware.

graphdb_suboptimal_queries_count

Count of queries that the GraphDB engine was not able to evaluate and were sent for evaluation to the RDF4J engine. A too high number might indicate that the queries typically used on the repository are not optimal.

graphdb_active_transactions

Count of currently active transactions.

graphdb_open_connections

Count of currently open connections. If this number stays high for a longer period of time, it might indicate an issue with connections not being closed once their job is done.

graphdb_entity_pool_reads

GraphDB’s entity pool reads count. Along with the entity pool writes count, this metric can be used to diagnose a small or oversized entity pool.

graphdb_entity_pool_writes

GraphDB’s entity pool writes count.

graphdb_epool_size

Current entity pool size, i.e., entity count in the entity pool.

Prometheus setup

To scrape the mentioned endpoints in Prometheus, we need to add scraper configurations. Below is an example configuration for three of the endpoints, assuming we have a repository called “wines”.

- job_name: graphdb_queries_monitor
metrics_path: /rest/monitor/repository/wines
scrape_interval: 5s
static_configs:
        - targets: [ 'my-graphdb-hostname:7200’ ]
- job_name: graphdb_hw_monitor
metrics_path: /rest/monitor/infrastructure
scrape_interval: 5s
static_configs:
        - targets: [  'my-graphdb-hostname:7200’ ]
- job_name: graphdb_structures_monitor
metrics_path: /rest/monitor/structures
scrape_interval: 5s
static_configs:
        - targets: [  'my-graphdb-hostname:7200’ ]

Cluster monitoring

When configuring Prometheus to monitor a GraphDB cluster, the setup is similar with a few differences.

In order to get the information for each cluster node, each node’s address must be included in the targets list.

The other difference is that another scraper must be configured to monitor the cluster status. This scraper can be configured in several ways:

  • Scrape only the external proxy (which will always point to the current cluster leader) if it exists in the current cluster configuration.

    The downside of this method is that if for some reason, there is a connectivity problem between the external proxy and the nodes, it will not report any metrics.

  • Scrape the external proxy and all cluster nodes.

    This method will enable you to receive metrics from all cluster nodes including the external proxy. This way, you can see the cluster status even if the external proxy has issues connecting to the nodes. The downside is that most of the time, each cluster will be duplicated for each cluster node.

  • Scrape all cluster nodes (if there is no external proxy).

    If there is no external proxy in the cluster setup, the only option is to monitor all nodes in order to determine the status of the entire cluster. If you choose only one node and it is down for some reason, you would not receive any cluster-related metrics.

The scraper configuration is similar to the previous ones, with the only difference that the targets array might contain one or more cluster nodes (and/or external proxies). For example, if you have a cluster with two external proxies and five cluster nodes, the scraper might be configured to scrape only the two proxies like so:

- job_name: graphdb_cluster_monitor
metrics_path: /rest/monitor/cluster
scrape_interval: 5s
static_configs:
        - targets: [  'graphdb-proxy-0:7200’,  'graphdb-proxy-1:7200’ ]

As mentioned, you can also include some or all of the cluster nodes if you want.

JMX console monitoring

The database employs a number of metrics that help tune the memory parameters and performance. They can be found in the JMX console under the com.ontotext.metrics package. The global metrics that are shared between the repositories are under the top level package, and those specific to repositories - under com.ontotext.metrics.<repository-id>.

_images/jmx-metrics.png

Page cache metrics

The global page cache provides metrics that help tune the amount of memory given for the page cache. It contains the following elements:

Parameter

Description

cache.flush

Counter for the pages that are evicted out of the page and the amount of time it takes for them to be flushed on the disc.

cache.hit

Number of hits in the cache. This can be viewed as the number of pages that do not need to be read from the disc but can be taken from the cache.

cache.load

Counter for the pages that have to be read from the disc. The smaller the number of pages is, the better.

cache.miss

Number of cache misses. The smaller this number is, the better. If you see that the number of hits is smaller than the misses, then it is probably a good idea to increase the page cache memory.

Entity pool metrics

You can monitor the number of reads and writes in the entity pool of each repository with the following parameters:

Parameter

Description

epool.read

Counter for the number of reads in the entity pool.

epool.write

Counter for the number of writes in the entity pool.