Kafka Topic backup to S3 and restore to another Kafka cluster using Strimzi and OpenEBS

Imran Pochi
Kubernauts
Published in
7 min readAug 18, 2019

--

TL;DR

Deploying Kafka via Strimzi operator (Helm chart), storage backed by OpenEBS. Taking backup of Kafka Topics to S3 with Kafka Connect Spredfast S3 Connector, restore Kafka Topics to a different Kafka cluster.

Introduction

While an introduction would usually begin talking about what is Apache Kafka, OpenEBS and other software components that we have here and “should you be running Kafka on Kubernetes or not?”, I’m going to skip that part as I believe if you reading this then you don’t need that introduction altogether. You already possess your storage chops and know your networking-fu.

This article talks about backing up and restoring KafkaTopics to S3 from one Kafka cluster in kafka-1 namespace to another Kafka cluster in kafka-2 namespace in the same Kubernetes cluster. You can also restore on a different Kubernetes cluster.

Prerequisites

  • We will need a Kubernetes 1.8+ cluster
  • Helm for installing Strimzi Operator
  • Strimzi Operator for installing Kafka
  • OpenEBS or any cloud native storage (Rook , Portworx etc.) for persistent storage
  • AWS account for creating S3 bucket

Installing Strimzi Operator

Operators such as Confluent operator, Strimzi operator help to lessen the burden of deploying/maintaining a Kafka cluster to a certain extent.

We will be using Helm to install Strimzi operator and with just a couple of commands we will have our Strimzi operator up and running in its own namespace.

# First add the Strimzi helm chart
$ helm repo add strimzi http://strimzi.io/charts/
# Install
$ helm install strimzi/strimzi-kafka-operator \
--name strimzi-cluster-operator \
--namespace strimzi \
--version 0.11.4

Installing Kafka cluster

We need to let Strimzi know, which namespaces it watches for managing the Kafka cluster.

First lets create a kafka-1 namespace and then redeploy Strimzi operator to watch kafka-1 namespace.

# create kafka-1 namespace
$ kubectl create namespace kafka-1
$ helm upgrade --reuse-values --set watchNamespaces="{kafka-1}" \
strimzi-cluster-operator strimzi/strimzi-kafka-operator

Lets install Kafka CRD, change the storage values such as class and size depending on your cluster. We have used OpenEBS for our purpose.

# Save the file as kafka-1.yaml
apiVersion: kafka.strimzi.io/v1alpha1
kind: Kafka
metadata:
name: kafka-1
spec:
kafka:
version: 2.1.0
replicas: 3
listeners:
plain: {}
tls: {}
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
log.message.format.version: "2.1"
delete.topic.enable: "true"
storage:
type: persistent-claim
size: 5Gi
deleteClaim: true
class: openebs-jiva-default
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 2Gi
deleteClaim: true
class: openebs-jiva-default
entityOperator:
topicOperator: {}
userOperator: {}

Apply the above YAML

$ kubectl apply -f kafka-1.yaml -n kafka-1

Creating Kafka Topic

To confirm that our Kafka cluster is up and running, let’s try to create some KafkaTopic and produce/consume some messages.

# Save this file as kafka-topic-1.yaml
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaTopic
metadata:
name: Animals
labels:
strimzi.io/cluster: kafka-1
spec:
partitions: 3
replicas: 3

Apply the YAML to create KafkaTopic

$ kubectl apply -f kafka-topic-1.yaml -n kafka-1# check if the topic got created successfully
$ kubectl get kafkatopic -n kafka-1

Produce and consume messages

In one terminal, run the below command and add some messages to the above created KafkaTopic

$ kubectl run kafka-producer -n kafka-1 -ti --image=strimzi/kafka:latest-kafka-2.1.1 \
--rm=true --restart=Never -- bin/kafka-console-producer.sh \
--broker-list kafka-1-kafka-bootstrap:9092 --topic Animals
# You should get a message like
# If you don't see a command prompt, try pressing enter.
# Press Enter and start producing messages.

In another terminal, we shall consume those messages

$ kubectl run kafka-consumer -n kafka-1 -ti --image=strimzi/kafka:latest-kafka-2.1.1 \
--rm=true --restart=Never -- bin/kafka-console-consumer.sh \
--bootstrap-server kafka-1-kafka-bootstrap:9092 \
--topic Animals --from-beginning
# You should get a message like
# If you don't see a command prompt, try pressing enter.
# Any messages that you'd have produced would be visible now

Installing KafkaConnect

In order to backup KafkaTopic, there are couple of solutions, that you can go with, such as KafkaMirror and KafkaConnect. In this article we will use Kafka Connect to backup to S3.

We need to extend Kafka Connect with plug-ins to incorporate. In order to do that, please follow this link to create a custom docker image of KafkaConnect that includes the open source Kafka Connect S3 connector plugin from Spredfast. For this article, I already have done so and you can use that.

NOTE: Before Deploying KafkaConnect, we need to provide AWS credentials. Lets create a secret that contains AWS_ACCESS_KEY and AWS_SECRET_KEY. Please change the base64 values

# Save this file as aws-secrets.yaml
apiVersion: v1
kind: Secret
metadata:
name: aws-creds
type: Opaque
data:
awsAccessKey: <Base64 encoded value>
awsSecretAccessKey: <Base64 encoded value>

Apply the YAML

$ kubectl apply -f aws-secrets.yaml -n kafka-1

Installing KafkaConnect

# Save this yaml as kafka-connect-1.yaml
apiVersion: kafka.strimzi.io/v1alpha1
kind: KafkaConnect
metadata:
name: kafka-connect-1
spec:
image: imranpochi/strmzi-kafka-connect-with-s3-plugin
version: 2.2.0
replicas: 1
bootstrapServers: kafka-1-kafka-bootstrap:9092
externalConfiguration:
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-creds
key: awsAccessKey
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-creds
key: awsSecretAccessKey

Apply the YAML

$ kubectl apply -f kafka-connect-1.yaml

Give it a minute and lets check if our Kafka setup is up and running. You should see a output similar to this:

$ kubectl get pods -n kafka-1
kafka-connect-1-d57c9c78b-xfwhm 1/1 Running 0 6h
kafka-1-entity-operator-c57z9 3/3 Running 0 6h
kafka-1-kafka-0 2/2 Running 0 6h
kafka-1-kafka-1 2/2 Running 0 6h
kafka-1-kafka-2 2/2 Running 0 6h
kafka-1-zookeeper-0 2/2 Running 0 6h
kafka-1-zookeeper-1 2/2 Running 0 6h
kafka-1-zookeeper-2 2/2 Running 0 6h

Backup

In order to start taking a backup, we need to first configure our S3 connector. We can do that by calling the REST API endpoints of Kafka Connect.

Lets do a port-forward first, so that we can make calls to localhost rather than ssh’ing into our Kubernetes cluster nodes.

$ kubectl port-forward deployment/kafka-connect-1 8083:8083 -n kafka-1

NOTE: Please create an S3 bucket before proceeding further

If you have the AWSCLI installed and credentials configured, please create an S3 bucket using the below command

$ aws s3api create-bucket \
--create-bucket-configuration LocationConstraint=eu-west-1 \
--region eu-west-1
--bucket kafka-animals-backup-bucket

Create a configuration JSON file for the s3 connector.

# save the file as backup-config.json
{
"name": "backup",
"config": {
"connector.class": "com.spredfast.kafka.connect.s3.sink.S3SinkConnector",
"format.include.keys": "true",
"topics": "Animals",
"tasks.max": "1",
"format": "binary",
"s3.bucket": "kafka-animals-backup-bucket",
"value.converter": "com.spredfast.kafka.connect.s3.AlreadyBytesConverter",
"key.converter": "com.spredfast.kafka.connect.s3.AlreadyBytesConverter",
"local.buffer.dir": "/tmp"
}
}

Execute the curl request in another terminal

$ curl -X POST -H "Content-Type: application/json" -H "Accept: application/json" -d @backup-config.json localhost:8083/connectors

If everything goes well and correct credentials are provided, backup of Animals KafkaTopic should start immediately.

To confirm that, you can login to the AWS console and find files related to the KafkaTopic in the bucket.

Creating 2nd Kafka cluster

While restore is a process that be done in another Kubernetes cluster by taking a backup of entire Kafka cluster along with Zookeeper, we will focus on the restore of the Kafa Topic that we backed up in the same Kubernetes cluster but different Kafka cluster.

To begin our restore process, we need to go through the same steps as we did before in a different namespace.

The steps of installing Kafka , Kafka-connect , creating secret for AWS credentials, creating Kafka-topic and importantly to redeploy Strimzi operator to watch for Kafka resources in the new namespace.

In order to shorten the amount of YAML files we spit out in this article, I have created a GitHub repo with all the necessary YAML configuration to deploy a new Kafka cluster in namespace kafka-2

$ git clone ipochi/kafka-strimzi-s3-bkp-restore
$ cd kafka-strimzi-s3-bkp-restore
# Create kafka-2 namespace
$ kubectl create ns kafka-2
# Redeploy Strimzi operator to watch for kafka-2 namespace as well
helm upgrade --reuse-values \
--set watchNamespaces="{kafka-1,kafka-2}" \
strimzi-cluster-operator strimzi/strimzi-kafka-operator
# Create AWS secret in the kafka-2 namespace
$ kubectl apply -f aws-secrets.yaml -n kafka-2
# Apply all the yaml together to bring up second kafka cluster
# This will create Kafka , KafkaTopic, KafkaConnect
$ kubectl apply -f kafka-cluster-2.yaml -n kafka-2

In order to begin the restore process, we need to configure the Spredfast S3 connector, similar to the JSON config file for backup.

# Save the file as restore-config.json
{
"name": "restore",
"config": {
"connector.class": "com.spredfast.kafka.connect.s3.source.S3SourceConnector",
"tasks.max": "1",
"topics": "Animals",
"s3.bucket": "kafka-animals-backup-bucket",
"key.converter": "com.spredfast.kafka.connect.s3.AlreadyBytesConverter",
"value.converter": "com.spredfast.kafka.connect.s3.AlreadyBytesConverter",
"format": "binary",
"format.include.keys": "true"
}
}

First we port-forward the Kafka service

kubectl port-forward deployment/kafka-connect-2 8083:8083 -n kafka-2

Next, in another terminal issue a REST call to store the config and start the restore process

curl -X POST -H "Content-Type: application/json" -H "Accept: application/json" -d @restore-config.json localhost:8083/connectors

Shortly you’ll see the restore process will start, to confirm lets spin up a kafka-consumer pod and check out all the message that you produced earlier.

kubectl run kafka-consumer -n kafka-2 -ti \
--image=strimzi/kafka:latest-kafka-2.1.1 \
--rm=true --restart=Never -- bin/kafka-console-consumer.sh \
--bootstrap-server kafka-2-kafka-bootstrap:9092 \
--topic Animals --from-beginning

Wait for a short amount of time and all the messages will be visible.

To check for latency and lag between backup and restore, an interesting exercise that can be done is, to run the kafka-producer pod from namespace kafka-1 and run the kafka-consumer pod in kafka-2 namespace.

Conclusion

We’ve reached the end of another article of the backup and restore series, that I am learning about. Apache Kafka is a beast but there are also other wonderful open source solutions out there, that take away the pain of managing Kafka.

Setting up Kafka using Helm and Strimzi operator was a breeze and S3 connector from SpredFast helped achieve the target of backing and restoring KafkaTopics effortlessly.

Gratitude and Sales pitch

I want to thank Kubernauts for the platform, where I can write about my explorations and work. I want to thank the excellent community of OpenEBS and Strimzi for their awesome open source products and their help in solving any questions, especially to Jakub Scholz for Strimzi.

Its not 2015, cloud native storage landscape is not the same anymore. Fantastic solutions exist for storage and OpenEBS is one of them. I am very happy to say that Kubernauts can provide commercial support for OpenEBS, so lets engage if you have any questions.

--

--