How to Install GraphDB in AWS

GraphDB can be deployed on Amazon Web Services by following the general installation instructions. You can find information regarding the costs of running a GraphDB instance on the AWS Services website.

This documentation will walk you through the process of setting up the necessary environment for deploying GraphDB on AWS.

Note

Ontotext maintains a Terraform module that automates this entire procedure. Learn more about how to use it at our GitHub repository.

Architecture

The GraphDB architecture diagram showcases the deployment architecture for GraphDB on EC2 instances in AWS cloud platform. The diagram illustrates the key components, and their interactions to provide a high-level understanding of the system’s architecture and how it should be deployed.

_images/aws-architecture.png

Note

There are no third-party integration points on the default GraphDB deployment.

Prerequisites

There are several prerequisites for running a GraphDB instance on AWS:

  • Access to an AWS account (we recommend the use of an Identity and Access Management user for the deployment instead of a root user account)

  • Active GraphDB license required to use the Enterprise functionalities of the database

  • Create a shell script used to initialize the EC2 instance

Note

The GraphDB Terraform module contains a Terraform template you can use when creating your shell script. If you use the Terraform template, you will need to replace the placeholder values of all variables with your actual values.

Technical requirements

The following AWS services are required to complete the GraphDB deployment on AWS:

Service

Description

Virtual Private Cloud (VPC)

Allows for the creation of a private network in AWS.

Elastic Compute Cloud (EC2)

A server instance that Elastic Kubernetes Service will be using as a managed node or a Server instance that will be used for hosting the database application.

Network Load Balancer (NLB)

For load balancing the GraphDB cluster nodes

Elastic Block Store (EBS)

EBS volumes will be used for storing the data

AWS Identity and Access Management (IAM)

Provides user and access management for your GraphDB deployment

AWS Systems Manager

Various GraphDB configurations are saved in the Parameter Store

Simple Storage Service (S3)

S3 buckets will be used for storing the backups

Required skills

Note

Deploying GraphDB on AWS EC2 requires a combination of skills in AWS infrastructure management, database administration, and system troubleshooting. Acquiring these skills may involve hands-on experience, self-study, online resources, and formal training programs provided by AWS or other educational platforms.

The following skills and knowledge are typically required in order to successfully deploy GraphDB on AWS EC2:

AWS Fundamentals

Familiarity with Amazon Web Services (AWS) and understanding of its core concepts, such as EC2 instances, security groups, VPCs and IAM roles. Knowledge of how to navigate the AWS Management Console and interact with AWS services is essential.

EC2 Instance Management

Proficiency in creating and managing EC2 instances. This includes selecting the appropriate instance type, configuring security settings, managing storage (EBS volumes), and understanding EC2 instance lifecycle management.

Networking and Security

Understanding of networking concepts in AWS, including VPC (Virtual Private Cloud) configuration, subnets, routing tables, and security groups. Knowledge of how to set up inbound and outbound traffic rules to allow communication with GraphDB.

Linux Administration

Proficiency in Linux command-line interface (CLI) and basic administration tasks. This includes SSH access to EC2 instances, navigating the file system, managing permissions, installing packages, and configuring system settings.

Database Management

Knowledge of GraphDB and its deployment requirements. Understanding of how to configure GraphDB settings, including database storage, memory allocation, and repository creation.

Database Backup and Recovery

Familiarity with backup and recovery strategies for GraphDB on AWS. Knowledge of AWS services like Amazon S3 for data backups and restoration processes.

Monitoring and Troubleshooting

Proficiency in monitoring the health and performance of GraphDB instances on AWS. Understanding of logging, monitoring and troubleshooting techniques using AWS CloudWatch, EC2 instance logs, and GraphDB diagnostic tools.

High Availability and Scalability

Knowledge of implementing high availability and scalability for GraphDB on AWS. This may involve using features like EC2 Auto Scaling, load balancers, and multi-Availability Zone (AZ) deployments.

Infrastructure as Code (IaC)

Familiarity with Infrastructure as Code principles and tools like AWS CloudFormation or Terraform. This enables automating the provisioning and configuration of GraphDB infrastructure on AWS.

Security Best Practices

Understanding of security best practices for AWS deployments, including data encryption, access controls, identity and access management, and compliance considerations.

Setting up your Virtual Private Cloud (VPC)

  1. Go to the VPC Management Console and click on Create VPC

  2. Select the VPC and more option. This will allow you to configure and create all other networking components such as subnets, gateways, and more

  3. Enter the following VPC configurations:

    1. Name tag auto-generation: Enter a descriptive name by which to recognize your VPC

    2. Number of Availability Zones: 3

    3. NAT Gateways: 1 per AZ (this will enable external Internet access)

  4. Check both the Enable DNS hostnames and Enable DNS resolution checkboxes

    _images/aws-vpc-03-vpc-set-up.png
  5. Click on Create VPC

Once you’ve completed this process, you will see various status messages as the system creates the subnets, NAT gateways, and other components of the VPC. This may take several minutes; you will know that it is finished when all the status messages have turned green and the View VPC button appears at the bottom.

Setting up your Route 53 private hosted zone

The GraphDB Raft implementation requires static addresses. This is achieved by creating a private hosted zone in Route 53 and registering the instances there.

  1. Go to the Route 53 dashboard and select Hosted Zones from the navigation menu on the left

  2. Click Create hosted zone

  3. Enter a domain name, such as graphdb.cluster

    _images/aws-route53-02-hosted-zone-configuration-top.png
  4. Choose Private hosted zone

  5. Under the VPC settings select your region and the VPC that you created

    _images/aws-route53-03-hosted-zone-configuration-bottom.png
  6. Click Create hosted zone

Hint

You may want to write down the Hosted Zone ID, as you will need it later.

Note

Later, you will also need to create “A” records for the instances.

Creating an S3 bucket

Tip

This step is optional, but recommended.

GraphDB can store backups to S3 and, if needed, restore from them. To create an S3 bucket:

  1. Go to the S3 console and click on Create bucket

  2. Enter a name for the bucket that is globally unique among all S3 buckets

    _images/aws-s3bucket-02-bucket-configuration.png
  3. Select your region, scroll down and click Create bucket

Once you’ve completed this process, you should then see a “Successfully created” message. We recommend you to block all public access, so that it isn’t accessible by anyone else.

Importing a certificate into Amazon Certificate Manager (ACM)

Tip

This step is optional.

While serving GraphDB requests over a secured and encrypted connection is not strictly required, it is highly recommended. This section goes over the process of importing a certificate into Amazon Certificate Managed (ACM) which could be used in the next section while creating the load balancer.

  1. Go to the Certificate Manager console and select Import certificate from the navigation menu on the left, or click on Import at the top

  2. Paste the PEM encoded certificate body

  3. Paste the PEM encoded private key

    Warning

    You need to remove the password for the key before pasting it

  4. You can optionally paste the certificate chain, then click on Next

  5. You can optionally add tags, then click on Next again

  6. Click on Import

Open the imported certificate and note the Amazon Resource Name (ARN) for it - you will need it when creating the load balancer lister.

Setting up the Load Balancer

  1. Go to the EC2 Dashboard and select Load Balancers from the navigation menu on the left

  2. In the top-right, make sure you are in the region for which you want to create a Load Balancer

  3. Click on Create load balancer

  4. Click on Network Load Balancer and enter the following configurations:

    1. Load balancer name: Enter a descriptive name by which to recognize your load balancer

    2. Choose the Internet-facing scheme (otherwise GraphDB will not be accessible externally)

    3. VPC: choose the VPC that was previously created

      _images/aws-loadbalancer-04-load-balancer-vpc.png
    4. Mappings: select all three availability zones and the public subnets in them

      _images/aws-loadbalancer-04-load-balancer-mappings.png
    5. Under Security groups, remove the Default security group

  5. Optionally, you can create a TLS listener and remove the TCP listener on port 80:

    1. Remove the TCP listener on port 80 (leaving it will result in unencrypted traffic)

    2. Click on Add listener and choose the following configurations:

      1. Protocol: select TLS from the previous step

      2. Port: 443

    3. Default action: select the target group from the previous step

    4. Secure Listener settings:

      1. Leave the security policy to the recommended one

      2. Select the certificate that you imported in ACM and leave the ALPN policy set to None

  6. Click on Create target group under Listener and routing

  7. Enter the following configurations:

    1. Target type: leave as instances

    2. Target group name: Enter a descriptive name by which to recognize your group

    3. Protocol: TCP

    4. Port: 7200

    5. Health check path: /rest/cluster/node/status

    6. Under Advanced health check settings, override the port to be 7201

    7. Go back to the load balancer creation page and select the target group that was created in the previous step

  8. Return to the Load Balancer page and refresh the list of Target Groups

  9. Select the one you just created, and click on Create load balancer

Setting up instance role and profile permissions

You have to grant GraphDB EC2 instances certain permissions so that they can perform several tasks in AWS. This section describes what permissions they need, what they are used for, and how to create them. To do this, we will create an instance profile and then attach it to the instances.

_images/aws-permissions-01-json-tab.png

To create a policy:

  1. Go to the Identity and Access Management (IAM) dashboard and select Policies from the navigation menu on the left

  2. Click on Create policy and go to the JSON tab

  3. Replace the JSON script with the one for the permission you are creating

  4. Click on Next

  5. Enter a Policy name in the Policy details section of the Review and create screen

  6. Click on Create policy

Warning

You need to create a different policy for each of the JSON scripts listed below.

Allow the EC2 instance to read, write, and list objects in S3 (Optional)

If you are planning to store GraphDB backups to S3, the EC2 instance needs to be able to read, write, and list objects in S3. To do this, paste the following JSON in the appropriate field when creating a policy, changing graphdb-backup-bucket to the name of your S3 bucket:

{
"Version":"2012-10-17",
"Statement":[
    {
        "Effect":"Allow",
        "Action":[
            "s3:ListBucket",
            "s3:*Object",
            "s3:GetAccelerateConfiguration",
            "s3:ListMultipartUploadParts",
            "s3:AbortMultipartUpload"
        ],
        "Resource":[
            "arn:aws:s3:::graphdb-backup-bucket",
            "arn:aws:s3:::graphdb-backup-bucket/*"
        ]
    }
]
}

Allow the listing of EC2 instances

The user data script that we configure later on will need permissions to list the EC2 instance where GraphDB runs. To grant the necessary permissions, paste the following JSON in the appropriate field when creating a policy:

{
"Version": "2012-10-17",
"Statement": [
    {
    "Effect": "Allow",
    "Action": [
        "ec2:DescribeInstances"
    ],
    "Resource": "*"
    }
]
}

Note

Some of those policies are needed only if the public TF module is used.

Allow the listing, creating, attaching, and tagging of EBS volumes

When instances start, the user data script will need to search for an available EBS volume to attach them, or if there aren’t any available - create a new one. This should be performed by the user data script. Alternatively, you can create the volumes separately, attach them to the instances, and mount them to the appropriate location. To achieve this, paste the following JSON in the appropriate field when creating a policy:

{
"Version": "2012-10-17",
"Statement": [
    {
    "Effect": "Allow",
    "Action": [
        "ec2:CreateVolume",
        "ec2:AttachVolume",
        "ec2:DescribeVolumes"
    ],
    "Resource": "*"
    }
]
}

Because the volumes need to be tagged when they are created, the EC2 instance also requires permissions for tagging the volumes. Repeat the same steps as above, but use the following JSON for the policy:

{
"Version": "2012-10-17",
"Statement": [
    {
    "Effect": "Allow",
    "Action": [
        "ec2:CreateTags"
    ],
    "Resource": [
        "arn:aws:ec2:*:*:volume/*",
        "arn:aws:ec2:*:*:snapshot/*"
    ],
    "Condition": {
        "StringEquals": {
        "ec2:CreateAction": [
            "CreateVolume",
            "CreateSnapshot"
        ]
        }
    }
    }
]
}

Allow adding of records to Route 53 private hosted zone

The user data script will also have to be able to create “A” records in a Route 53 private hosted zone. To add the needed permissions, you will first need to find the ID of your hosted zone:

  1. Go to the Route 53 dashboard

  2. Select your zone

  3. Expand Hosted zone details

  4. Copy the Hosted zone ID

Once you have obtained your hosted zone ID, replace <zone_id> with the hosted zone ID in the JSON below, and paste it in the appropriate field when creating a policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "route53:GetHostedZone",
                "route53:ListResourceRecordSets",
                "route53:ListHostedZones",
                "route53:ChangeResourceRecordSets",
                "route53:ListResourceRecordSets",
                "route53:GetHostedZoneCount",
                "route53:ListHostedZonesByName"
            ],
            "Resource": "arn:aws:route53:::hostedzone/<zone_id>"
        }
    ]
}

Creating an IAM role and attaching policies

Once you’ve created your different policies, you can create a role and associate it with the newly created policies:

  1. Go to the IAM dashboard, and select Roles from the navigation menu on the left

  2. Click on Create role

  3. Select AWS Services ‣ EC2 and click “Next”

  4. Search for the policies that you created in the previous sections and select them

    _images/aws-irp-06-permission-policies.png
  5. Additionally, we recommend you add a couple of optional policies:

    • The AmazonSSMFullAccess policy allows you to gain access to the EC2 instances using AWS Systems Manager

    • The CloudWatchAgentServerPolicy policy is required if you are planning on scraping the GraphDB Prometheus endpoints and pushing the metrics to Amazon CloudWatch

  6. Once you’ve selected all of your policies, click on Next

  7. Add a name for the role and click on Create role

Launching GraphDB instances

Setting up the Launch template

Before you can launch your GraphDB instance, you will need to create a launch template and add a security group to it.

Note

  • Launch templates describe what configurations will be used for the machines

  • Security groups act like a firewall for the resources in AWS

  1. Go to the EC2 Dashboard and select Launch Templates from the navigation menu on the left.

  2. Click on Create launch template and fill in the following configurations

    1. Launch template name: Enter a descriptive name by which to recognize your launch template

    2. Auto-scaling Guidance: Set to On

    3. Application and OS images: Click on Quick Start and select Ubuntu 22.04

      Note

      AMI will be available in the future.

    4. Instance type: select a type such as r6i.2xlarge, which should be more than sufficient for a repository with one billion triples

    5. Storage (volumes) and Resource tags: not required and should be left as is

    6. Under Network settings, select Create security group

      1. Select your VPC

      2. Leave the default outbound rule set as shown unless you want to restrict it

      3. Add an inbound rule for port 7200 for the CIDR blocks that will be allowed to access GraphDB

      4. Add an inbound rule for port range 7200 (for the proxy) - 7201 (for GraphDB) and add the subnets of the load balancer

      5. Add an inbound rule for port ranges 7200-7201 and 7300-7301 and add the private subnets

      6. Add a Description

      7. Click on Create security group

    7. Return to the Launch template creation page, refresh the list of security groups, and select the newly created security group

    8. Under Advanced details, fill in the following configurations:

      1. IAM instance profile: select the role you created in the previous section

      2. At the very bottom of the form, configure the user data script that will be responsible for installing the necessary tools and GraphDB on the machines.

        Tip

        An example script is available on the Ontotext-AD github repo - just be sure to replace all Terraform template variables.

        Note

        If the user data script is already base64-encoded, check the box under the script field.

    9. Click on the orange Create launch template button

    10. On the Next steps screen that appears, click on View launch templates

  3. Select Auto Scaling Groups from the bottom of the navigation menu on the left

  4. Click on Create Auto Scaling group and fill in the following configurations

    1. Auto Scaling group name: Enter a descriptive name by which to recognize your auto scaling group

    2. Launch template: Select the launch template that was created previously and click the orange Next button

    3. VPC: Select the VPC that was created previously

    4. Availability Zones and subnets: Select all three private subnets and click Next

    5. Under Load Balancing, select Attach to an existing load balancer

    6. Select Choose from your load balancer target groups, then select the load balancer target group you created earlier

    7. (Optional) Under Health Checks, you can tune the Health check grace period, as well as turn on the Elastic Load Balancing health checks, which will allow the load balancer to trigger the recreation of an instance

    8. Click on Next

    9. Under Group Size, enter the number of nodes in your cluster (or in this case - 3) for Desired, Minimum, and Maximum capacity

    10. Click on Next until you reach the Review page

    11. Click on Create Auto Scaling group at the bottom.

Eventually, the three new instances will be listed on the Instances screen available on the left sidebar.

Creating a cluster

In order to create the cluster, you will need to get the address of the EC2 instances.

  1. Go to the Route 53 dashboard and open you hosted zone

  2. Write down all records names, for records with type “A”

Once you’ve written down all records names, you can create a cluster by following the Creating and Managing a Cluster documentation from one of the instances.

Tip

The recommended way to gain access to the instances is to attach the AmazonSSMFullAccess policy, and then use the AWS CLI to connect to an instance by its ID:

aws ssm start-session --target i-04d62ace38b78d994

After doing this, you should be able to use standard utilities like sudo or su to change the user — for example, ubuntu.

Opening GraphDB instances

After all instances are running, you can launch your GraphDB instance:

  1. Access the Load balancer

  2. Copy the DNS name

  3. Paste it in the address bar of your browser and press Enter

Updating GraphDB configurations and versions

Because GraphDB and its configurations are baked into the AMI, you will need to recreate the EC2 instances when updating your GraphDB configuration to a newer minor version. You can do this by either manually stopping each individual instance, or scaling the cluster out, and then scaling it back in. This section describes both methods in detail.

Note

Make sure your user data script mounts the storage back to the instances.

Stopping individual EC2 instances

The faster way to update to a newer minor version and its configuration is to stop each individual EC2 instance. The downside to this method is that you are decreasing the cluster HA. In other words, if a node fails while recreating an instance, the cluster will be unable to process writes. The process is simple:

  1. Update the AMI or the user data script in the launch template

  2. Terminate the instances one by one, starting with the follower nodes, and leaving the leader node to be the last instance to be terminated

Note

To avoid compatibility issues, also refer to the Migrating GraphDB Configurations documentation.

Warning

When you terminate an instance, wait for the new one to be started. Then verify that it has successfully rejoined the cluster and that it is in sync before proceeding with the next one.

Scaling the cluster out and then back in

You can also recreate the EC2 instances by scaling the cluster out and in. The advantage of this approach is that the HA will not be impacted. However, the cluster will need to replicate its state to the new nodes. This can take a significant amount of time, especially with bigger-sized repositories.

  1. Update the AMI or the user data script in the launch template

  2. Double the size of the cluster

    Note

    Change the minimum, maximum and desired size of the auto scaling group

  3. Once the new instances are started, join them to the cluster and wait until they are healthy and in sync with the cluster

    Note

    Make sure to join the nodes with a single API call to avoid replicating the cluster state multiple times

  4. Add scale in protection on the new nodes

  5. Remove the old nodes from the cluster

  6. Change the minimum, maximum and desired size of the auto scaling group to their original value

  7. Remove the scale in protection