AKS Deployment Automation with Terraform and Multi-AKS Cluster Management with Rancher, AAD Integration and more

Arash
Kubernauts
Published in
13 min readMar 16, 2019

--

Introduction

In one of our running Kubernetes projects, we have to deploy 10+ k8s clusters for running business critical apps and let these apps to talk to each other and allow access from on-prem external k8s clusters to them.

Due to some strategic decisions the project owners decided to give Azure Kubernetes Service a try and manage all AKS clusters through a Rancher Management Server Cluster running on an AKS made k8s cluster itself or on Rancher Kubernetes Engine RKE.

Running Rancher Management Server on RKE is our favourite option and highly recommended to use in long term. For our daily k8s development with Rancher, we’re running RKE on top of RancherOS on Bare-Metal servers with Metal-LB, which is our most stable and affordable solution so far.

But for now let’s not to talk about political decisions, but more about AKS deployment automation with terraform and run Rancher Management Server on top of AKS to manage other AKS or RKE clusters and integrate the whole thing with AAD (Azure Active Directory) and make use of Azure Storage to manage state for our teams.

At this time of writing, there are at least 5 approaches to deploy managed Kubernetes Clusters through Azure Kubernetes Service AKS, via Azure Portal, with CLI, with ARM Templates or Terraform scripts and additional modules or via Rancher Management Server itself.

In this first post I’m going to share all these options with a detailed implementation for AKS with our favourite DevOps tool Terraform from the awesome folks by HashiCorp and use Rancher to manage access via Azure Active Directory (AAD) for our users and do much more exciting things with Rancher and TK8 in the next blog post, which will be about how to deploy RKE with TK8 and Terraform in a Custom VNET with Kubenet on Azure.

Sources on Github

All sources are provided on Github:

$ git clone https://github.com/kubernauts/aks-terraform-rancher.git

Prerequisites

Azure CLI
Terraform

Deployment via Azure Portal

This is the easiest deployment method, on Azure find the Kubernetes Service, click add, select a subscription, an existing resource group, or create a new one, select the k8s version, etc.. and click, click, click, then create, your AKS made k8s cluster is deployed in about 10 minutes, awesome!

You can download the ARM template-file template.json which is used in background for the AKS deployment through Azure Portal in the last step and create the deployment via CLI with ARM template-file “template.json” and “parameters.json” file as follow.

Deployment with ARM Template

Create a resource group with az tool:

$ az group create --name aks-dev-test-rg --location westeurope

Deploy AKS in the resource group created above:

$ az group deployment create --name k8s-dev-test-deployment --resource-group aks-dev-test-rg --template-file template.json — parameters parameters.json

Deployment with CLI

To deploy with CLI, you may use an existing Azure Active Directory Service Principal or create a new one for the deployment with az tool. AAD Service Principals are used for automation authentication for the deployment. With the following command we’ll create a new service principal and skip creating the default assignment, which allows the service principal to access resources under the current subscription.

$ az ad sp create-for-rbac — skip-assignment{“appId”: “xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz”, → — service-principal“displayName”: “azure-cli-2019–02–23–11-xyz”,“name”: “http://azure-cli-2019-02-23-11-31-36",“password”: “xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz”, → — client-secret“tenant”: “xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz”}

The following az aks create command, creates an AKS 3 node cluster in an existing vnet. To find the id of an existing vnet, use the following command:

$ az network vnet subnet list — resource-group dev-test-rg — vnet-name dev-test-vnet — query [].id — output tsv

Create the AKS cluster with CLI:

$ az aks create \ — resource-group xyz-rg \ — name k8s-dev-test \ — service-principal xyz \ — client-secret xyz \ — node-count 3 \ — generate-ssh-keys \ — nodepool-name devpool \ — node-vm-size Standard_DS2_v2 \ — service-cidr 10.0.0.0/16 \ — network-plugin azure \ — vnet-subnet-id /subscriptions/xyz/resourceGroups/xyz-rg/providers/Microsoft.Network/virtualNetworks/xyz-vnet/subnets/xyz-subnet \ — tags environment=dev-test \ — enable-addons monitoring \ — docker-bridge-address 172.17.0.1/16

AKS Deployment Automation with Terraform, the hard, but the right way

The deployment methods mentioned above are great for rapid deployments through a single user, but if you need to have full control of the AKS cluster state through your DevOps teams and apply the Infrastructure as Code (IaC) principle with GitOps and DevSecOps culture in mind, you may consider this implementation. Kari Marttila explains in this nice post ”Creating Azure Kubernetes Service the right way” why terraform is the better choice for humans!

The main goals for this implementation was:

  • Use Azure Storage Account to manage terraform state for teams
  • Use Azure Key Vault or HashiCorp Vault to retrieve secrets and keys for higher security
  • Use a custom terraform role and service principal for deployment (least privilege)
  • Use Azure Active Directory and deploy an RBAC-enabled AKS Cluster
  • Use Rancher Management Server to manage multiple AKS clusters and govern access to users through Azure Active Directory integration
  • Rancher Management Server shall run in HA mode on AKS cluster itself
  • If Rancher Management Server is not used or becomes unavailable, DevOps teams shall still be able to access the clusters managed by Rancher
  • Use Terragrunt and Git for terraform code changes and extensions through different DevOps teams (not in the repo yet)

Create a Storage Account to manage terraform state for different clusters

Create in westeurope region a new resource groupe storage-account-rg and in this resource group create a storage account named “acemesa” with a container named tfstate:

$ source create-azure-storage-account.sh westeurope storage-account-rg acemesa tfstate

The output of the command provides the access_key of the storage account, please take a note of the access_key of the storage account, or head to the azure portal and copy the key1 value of the “acemesa” storage account. We’ll store this access key in azure vault as “terraform-backend-key” in the next step after creating the key vault in a new resource group.

Create Azure Key Vault

Create a resource group named “key-vault-rg”:

$ az group create --name key-vault-rg --location westeurope

Create an azure key vault in this resource group:

$ az keyvault create --name “aceme-aks-key-vault” --resource-group “key-vault-rg” --location “westeurope”

Create a new secret named “terraform-backend-key” in the key vault with the value of the storage access key created above:

$ az keyvault secret set --vault-name “aceme-aks-key-vault” --name “terraform-backend-key” --value <the value of the access_key key1>

Verify if you can read the value of the created secret “terraform-backend-key”:

$ az keyvault secret show --name terraform-backend-key --vault-name aceme-aks-key-vault --query value -o tsv

Export the environment variable “ARM_ACCESS_KEY”, to be able to initialise terraform with the storage account backend:

$ export ARM_ACCESS_KEY=$(az keyvault secret show --name terraform-backend-key --vault-name aceme-aks-key-vault — query value -o tsv)

Verify if the access key has been exported properly:

$ echo $ARM_ACCESS_KEY

Initialise terraform for AKS deployment

Initialise Terraform with the storage account as backend to store “aceme-management.tfstate” in the container “tfsate” created in the first step above:

$ terraform init -backend-config=”storage_account_name=acemesa” -backend-config=”container_name=tfstate” -backend-config=”key=aceme-management.tfstate”

With this we make sure that all team members can use the same terraform state file stored in azure storage account, to learn more about it, please head to azure storage and terraform documentation on azure portal.

Create a custom terraform service principal with least privilege to perform the AKS deployment

Execute the following createTerraformServicePrincipal.sh script provided by Richard Cheney to create the terraform service principal and the provider.tf file (the script is provided in the git repo as well):

The script will interactively:

  • Create the service principal (or resets the credentials if it already exists)
  • Prompts to choose either a populated or empty provider.tf azurerm provider block
  • Exports the environment variables if you selected an empty block (and display the commands)
  • Display the az login command to log in as the service principal
$ ./createTerraformServicePrincipal.sh

The output will be similar to this:

{“appId”: “xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz”,“displayName”: “terraform-xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz”,“name”: “http://terraform-xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz",“password”: “xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz”,“tenant”: “xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz”}

Create a file named e.g. export_tf_vars and provide the TF_VAR_client_id with the value of “appId” and TF_VAR_client_secret with the value of “password” from the service principal output above, your export_tf_vars file should contain the following 2 lines for now. We need to extend it later after creating the server and client applications in the next steps.

export TF_VAR_client_id=xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz
export TF_VAR_client_secret=xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz

For security reasons, make sure to store the client id and secret in azure key vault:

az keyvault secret set --vault-name “aceme-aks-key-vault” --name “TF-VAR-client-id” --value xyzaz keyvault secret set --vault-name “aceme-aks-key-vault” --name “TF-VAR-client-secret” --value xyz

N.B.: we’ll use these values from the commands above in export_tf_vars file later!

Azure Active Directory Authorization

To secure an AKS cluster with Azure Active Directory and RBAC, this nice implementation from Julien Corioland was used.

In short, in order to enable Azure Active Directory authorization with Kubernetes, you need to create two applications:

  • A server application, that will work with Azure Active Directory
  • A client application, that will work with the server application

Multiple AKS clusters can use the same server application, but it’s recommended to have one client application per cluster.

Open the server application creation script create-azure-ad-server-app.sh and update the environment variables with the values you want to use:

export RBAC_AZURE_TENANT_ID=”REPLACE_WITH_YOUR_TENANT_ID”
export RBAC_SERVER_APP_NAME=”AKSAADServer2"
export RBAC_SERVER_APP_URL=”http://aksaadserver2"
# on mac doesn’t work, on linux?
# export RBAC_SERVER_APP_SECRET=”$(cat /dev/urandom | tr -dc ‘a-zA-Z0–9’ | fold -w 32 | head -n 1)”
# on mac
export RBAC_SERVER_APP_SECRET=”$(LC_CTYPE=C tr -dc A-Za-z0–9_\!\@\#\$\%\^\&\*\(\)-+= < /dev/urandom | head -c 32 | xargs)”

Execute the script:

$ ./create-azure-ad-server-app.sh

Once created you need to ask an Azure AD Administrator to go to the Azure portal and click the Grant permission button for this server app (Active Directory → App registrations (preview) → All applications → AKSAADServer2) .

Click on AKSAADServer2 application → Api permissions → Grant admin consent

Copy the following environment variables to the client application creation script:

export RBAC_SERVER_APP_ID=xyzxyzxyzxyzxyzxyzxyzxyzxyzxyzexport RBAC_SERVER_APP_OAUTH2PERMISSIONS_ID=xyzxyzxyzxyzxyzxyzxyzxyzxyzxyzexport RBAC_SERVER_APP_SECRET=xyzxyzxyzxyzxyzxyzxyzxyzxyzxyz

And execute / source the script:

$ source create-azure-ad-client-app.sh

For security reasons you may want to store all values in azure key vault:

az keyvault secret set — vault-name “aceme-aks-key-vault” — name “TF-VAR-rbac-server-app-id” — value xyzaz keyvault secret set — vault-name “aceme-aks-key-vault” — name “TF-VAR-rbac-server-app-secret” — value xyzaz keyvault secret set — vault-name “aceme-aks-key-vault” — name “TF-VAR-rbac-client-app-id” — value xyzaz keyvault secret set — vault-name “aceme-aks-key-vault” — name “TF-VAR-tenant-id” — value xyz

Your export_tf_vars looks like this at the end (this file is provided in the git repo):

export TF_VAR_client_id=$(az keyvault secret show — name TF-VAR-client-id — vault-name aceme-aks-key-vault — query value -o tsv)export TF_VAR_client_secret=$(az keyvault secret show — name TF-VAR-client-secret — vault-name aceme-aks-key-vault — query value -o tsv)export TF_VAR_rbac_server_app_id=$(az keyvault secret show — name TF-VAR-rbac-server-app-id — vault-name aceme-aks-key-vault — query value -o tsv)export TF_VAR_rbac_server_app_secret=$(az keyvault secret show — name TF-VAR-rbac-server-app-secret — vault-name aceme-aks-key-vault — query value -o tsv)export TF_VAR_rbac_client_app_id=$(az keyvault secret show — name TF-VAR-rbac-client-app-id — vault-name aceme-aks-key-vault — query value -o tsv)export TF_VAR_tenant_id=$(az keyvault secret show — name TF-VAR-tenant-id — vault-name aceme-aks-key-vault — query value -o tsv)

Deploy AKS

Now you can harvest your hard work by creating a plan for your first management cluster and apply it to create your first AKS made k8s cluster:

$ export ARM_ACCESS_KEY=$(az keyvault secret show --name terraform-backend-key --vault-name aceme-aks-key-vault --query value -o tsv)$ source export_tf_vars$ terraform plan -out rancher-management-plan$ terraform apply rancher-management-plan -auto-approve

Configure RBAC

After the cluster is deployed, we need to create Role/RoleBinding and ClusterRole/ClusterRoleBinding objects using the Kubernetes API to give access to our Azure Active Directory user and groups.

In order to do that, we need to connect to the cluster. You can get an administrator Kubernetes configuration file using the Azure CLI:

$ az aks get-credentials -n CLUSTER_NAME -g RESOURCE_GROUP_NAME — admin$ az aks get-credentials -n k8s-pre-prod -g kafka-pre-prod-rg — admin$ k get nodes

The repository contains a simple ClusterRoleBinding object definition file cluster-admin-rolebinding.yaml that will make sure that the Azure Active Directory user ak@cloudssky.com can get cluster-admin role:

$ kubectl apply -f cluster-admin-rolebinding.yaml

You can also create RoleBinding/ClusterRoleBinding for Azure Active Directory group, as described here.

Connect to the cluster using RBAC and Azure AD

Once all your RBAC objects are defined in Kubernetes, you can get a Kubernetes configuration file that is not admin-enabled using the az aks get-credentials command without the — admin flag.

$ az aks get credentials -n CLUSTER_NAME -g RESOURCE_GROUP_NAME

When you are going to use kubectl you are going to be asked to use the Azure Device Login authentication first:

$ kubectl get nodes

To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXXXXX to authenticate.

Deploy the next AKS dev-test cluster

Create a new resource group:

$ az group create --name aceme-dev-test-rg --location westeurope

Initialize terraform with a new terraform state backend aceme-dev-test.tfstate:

$ terraform init -backend-config=”storage_account_name=acemesa” -backend-config=”container_name=tfstate” -backend-config=”key=aceme-dev-test.tfstate”

Run the new plan and save it:

$ terraform plan -var resource_group_name=aceme-dev-test-rg -var aks_name=aceme-kafka-dev-test -out aceme-kafka-dev-test-plan

Deploy the dev-test cluster:

$ terraform apply aceme-kafka-dev-test-plan -auto-approve

Rancher HA Deployment on AKS

Please refer to this deployment guide to install Rancher Management Server on your AKS Cluster.

Rancher AAD Integration:

Please refer to this documentation for AAD integration:

https://rancher.com/docs/rancher/v2.x/en/admin-settings/authentication/azure-ad/

Get the Application Secret from key vault:

$ az keyvault secret show --name rancher-aad-secret --vault-name aceme-aks-key-vault --query value -o tsv

Provide the following variables from your azure account in Rancher AAD integration interface:

Tenant ID: xyzApplication ID: xyzEndpoint: https://login.microsoftonline.com/Graph Endpoint: https://graph.windows.net/Token Endpoint: https://login.microsoftonline.com/xyz/oauth2/tokenAuth Endpoint: https://login.microsoftonline.com/xyz/oauth2/authorize

Import AKS Clusters into Rancher Management Server

In Rancher click add cluster and select IMPORT:

Provide a cluster name, e.g. aceme-kafka-pre-prod:

Click create, Rancher provides the commands needed to import the AKS cluster, you can find the cluster user with:

$ terraform output kube_config | grep clusterUser$ kubectl create clusterrolebinding cluster-admin-binding --clusterrole cluster-admin --user <provide the user from command above>$ curl --insecure -sfL https://aceme-rancher-ingress.westeurope.cloudapp.azure.com/v3/import/xyz.yaml | kubectl apply -f -

Your Rancher Cluster Management Server should look similar to this:

Destroy the cluster

To destroy the cluster you shall run terraform destroy, please provide the right resource group name and aks cluster name:

$ terraform destroy -var resource_group_name=aceme-kafka-pre-prod-rg -var aks_name=kafka-pre-prod

Gotchas and TroubleShooting

Problem:
In Rancher if you call an AKS cluster, you’ll be faced with a Gotcha like this:

This is a known problem related to AKS, since kubectl get componentstatus delivers the wrong status of controller manager and scheduler:

$ kubectl get componentstatusNAME STATUS MESSAGE ERRORcontroller-manager Unhealthy Get http://127.0.0.1:10252/healthz: dial tcp 127.0.0.1:10252: connect: connection refusedscheduler Unhealthy Get http://127.0.0.1:10251/healthz: dial tcp 127.0.0.1:10251: connect: connection refused

Solution:
Ignore it for now, it doesn’t hurt so much, this is a known issue.

Delete the cattle-system namespace

Problem:
In few cases after importing an AKS cluster and removing the cluster again through the Rancher interface, the clean up procedure doesn’t work as desired and the cattle-system namespace keeps in terminating state.

Solution:
Run kubectl edit namespace cattle-system and remove the finalizer called controller.cattle.io/namespace-auth, then save. Kubernetes won’t delete an object that has a finalizer on it.

Reference:
https://github.com/rancher/rancher/issues/14715

$ cat delete-cattle-system-nsNAMESPACE=cattle-systemkubectl proxy &kubectl get namespace $NAMESPACE -o json |jq ‘.spec = {“finalizers”:[]}’ >temp.jsoncurl -k -H “Content-Type: application/json” -X PUT — data-binary @temp.json 127.0.0.1:8001/api/v1/namespaces/$NAMESPACE/finalize

Observations

From time to time for few minutes the AKS clusters imported into Rancher are shown in pending state and I couldn’t find anything suspicious in the logs. Well, as long the k8s clusters are reachable and our workloads work, I think it doesn’t hurt so much.

Conclusion

AKS is a free service and very young, Microsoft strives to attain at least 99.5% availability for the Kubernetes API server through their SLA. But the nice thing is, we have the freedom of choice and can run K8s clusters with RKE on Azure as well and do real IaC with Terraform and extend beyond ARM, stay tuned.

Related links and references:

How Terraform works, an introduction

Terraform — The definitive guide for Azure enthusiasts

https://thorsten-hans.com/terraform-the-definitive-guide-for-azure-enthusiasts

Terraform Azure Kubernetes Service cluster script by HashiCorp
https://github.com/hashicorp/vault-guides/tree/master/identity/vault-agent-k8s-demo/terraform-azure

Protect the access key via key vault

https://docs.microsoft.com/en-us/azure/terraform/terraform-backend

Secure an Azure Kubernetes cluster with Azure Active Directory and RBAC

https://github.com/jcorioland/aks-rbac-azure-ad

Azure Kubernetes Service (AKS) with Terraform

https://github.com/anubhavmishra/terraform-azurerm-aks

How to: Use Terraform to deploy Azure Kubernetes Service in Custom VNET with Kubenet

https://blog.jcorioland.io/archives/2019/03/13/azure-aks-custom-vnet-kubenet-terraform.html

Using Terraform to extend beyond ARM
https://azurecitadel.com/automation/terraform/lab8/

Terraform and multi tenant environment
https://azurecitadel.com/automation/terraform/lab5/

Creating Azure Kubernetes Service (AKS) the Right Way

https://medium.com/@kari.marttila/creating-azure-kubernetes-service-aks-the-right-way-9b18c665a6fa

DEPLOY RANCHER ON AZURE FOR KUBERNETES MANAGEMENT

http://www.buchatech.com/2019/03/deploy-rancher-on-azure-for-kubernetes-management/

--

--