Backing up AKS cluster with Velero
Let's walk through the backup process of a Kubernetes cluster and its persistent volumes using Velero (previously known as Ark).
Docker and Kubernetes are the modern-day runtime environments in almost every company who aims to implement cutting edge technology. It gives so many conveniences out of the box like rolling deployments, high availability, restarting failed containers (aka self-healing), well-managed secrets, and the list goes on...
Kubernetes also provides something called persistent volumes meaning storing container data outside the container. So if the container dies, a new container is brought up and is linked to the persistent volume without losing data. Isn't that great? well, Kubernetes does that for you. If Kubernetes is doing so much why do you need something like Velero to do the same thing? Of course, there's more to it. What if the whole node dies ? or one of the developers accidentally deletes the nodes (happens more often than you think. :P guilty). Then you are losing persistent volumes as well and if the whole cluster dies nobody knows what was the state of different pods and data in it was just before it dies.
Well, there comes Velero. Velero gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. here's what Velero can do for you.
- Take scheduled backups of your cluster and restore it in case of loss.
- Migrate cluster resources to other clusters.
- Replicate your production cluster to development and testing clusters.
How Velero works:
Not going full Rambo here. just a quick overview. Kubernetes stores cluster state (basically the k8s API objects) in the folder within the cluster /etcd.
Velero keeps backup of this folder in some remote location of your choice. The only condition is these remote locations should be object stores. Common examples are AWS S3 buckets and Azure Blob Storage. Now even if your cluster dies you know the state through these remote locations and can be restored, again with the help of Velero.
Common Use Case:
Say you have an Elasticsearch cluster running within your Kubernetes cluster. This cluster has 100s of GBs of data stored on persistent volumes. Valero will store k8s API objects in the container blob storage and persistent volumes will be stored as a snapshot. (say we are using Azure)
A simple command can fetch and restore these files from container blob storage which has pointers to these snapshots and viola! your cluster is up and running.
There are two steps for using Velero: first, install the Velero binary on your host machine and then run the Velero command to install it on your cluster.
to install it on Mac it's pretty straight forward.
brew install velero
to check if the installation was successful try running
Installing Velero on AKS cluster
- a blob storage container where Velero can store the backups
# name of the Storage where storage container is export AZURE\_STORAGE\_ACCOUNT\_ID="velerobackup" # Name of the storage account's Resource Group # I have a very good reason to call it RG and not Resource Group # you will learn it soon export RG="backup" # name of your backup container export BLOB\_CONTAINER="aks-backup" # This resource group is different than the one you set above # Explanation after these commands AZURE_RESOURCE_GROUP=$(az aks show --query nodeResourceGroup --name <AKS-CLUSTER-NAME> --resource-group <AKS-CLUSTERS-RESOURCE-GROUP> --output tsv) # Azures subscription ID you are working in AZURE_SUBSCRIPTION_ID=$(_az_ account list --query '\[?isDefault\].id' -o tsv) # Azures Tenant ID AZURE_TENANT_ID=$(_az_ account list --query '\[?isDefault\].tenantId' -o tsv)
Did you notice there are two different resource groups in the above commands? well if you did congrats!!! You just saved at least an hour.
The reason behind this is when you spin up an AKS cluster in a resource group of your choice, behind the scene, Azure creates another resource group, it is the “cluster resource group” and is used to represent and hold the lifecycle of resources underneath it. This is weird but this is by design and we should deal with it. more this here.
Next, you will need a service principle that will allow the AKS cluster to read and write files to this storage account.
# get service principles password AZURE_CLIENT_SECRET=$(_az_ ad sp create-for-rbac -n $AZURE\_STORAGE_ACCOUNT_ID --role contributor --query password --output tsv) #get service principles ID AZURE_CLIENT_ID=$(az ad sp show --id http://$AZURE_STORAGE_ACCOUNT_ID --query appId --output tsv)
By now you have all the variables required to install Velero on the AKS cluster. Next, you'll dump all the required values from above into a file called credentials-velero. You can name it whatever you want just make sure to update the commands accordingly.
_echo_ "\ AZURE\_SUBSCRIPTION\_ID=$AZURE\_SUBSCRIPTION\_ID \n\ AZURE\_TENANT\_ID=$AZURE\_TENANT\_ID \n\ AZURE\_CLIENT\_ID=$AZURE\_CLIENT\_ID \n\ AZURE\_CLIENT\_SECRET=$AZURE\_CLIENT\_SECRET \n\ AZURE\_RESOURCE\_GROUP=$AZURE\_RESOURCE\_GROUP" \ _\> ./credentials-velero
Prep work is done. Now comes the crucial part. Installing Velero on AKS cluster.
velero install \ --provider azure \ --plugins velero/velero-plugin-for-microsoft-azure:v1.0.1 \ --bucket $BLOB\_CONTAINER \ --secret-file ./credentials-velero \ --backup-location-config resourceGroup=$RG,storageAccount=$AZURE_STORAGE_ACCOUNT\_ID,subscriptionId=$AZURE_SUBSCRIPTION_ID \ --snapshot-location-config resourceGroup=$RG,subscriptionId=$AZURE_SUBSCRIPTION_ID
In the above command, you are passing velero-credentials file as a value to argument secret-file. You are also specifying where to store cluster state (CRD) and where to put snapshots of persistent volumes.
After running this command you will see a long list of resources created in your AKS cluster. Most of which you don't need to understand. Everything will be created in a newly created namespace called velero.
What you should look for is if the last line says
Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.
If you see this you are good.
Now one crucial step is tagging your resource which needs to back up. whattttt??? doesn't Velero back up everything? well, it does, but not persistent volumes. you have to explicitly specify persistent volumes of which pods to back up.
you will have to specify a label on the pods to tell Valero to back up their persistent volume as a snapshot. A label can be anything that makes sense to you. we will use backup=true for our purpose.
kubectl label pods <NAME-OF-POD> backup=true
finally, you can backup. here's a command.
velero backup create <NAME> — selector backup=true
This will back up every pod who has label backup=true in its metadata along with CRDs in etcd folder. Visit your storage container and snapshots to verify your backup is successful.
PS: It takes a while for snapshot backups to show up in snapshots listing in Azure. In my case, it was 5 minutes.
To list your backup in the terminal run
velero get backups
you should see one backup.
To restore this backup you will simply have to run
velero restore create — from-backup <NAME>
This will restore from a specific backup.
To test this out, Try deleting a deployment after backing up the cluster and restoring it.
Usually, you would want to run backups in a timely manner. Velero provides a utility for that as well.
velero create schedule daily --selector backup=true --schedule="@every 24h"
This will create a scheduled backup that runs every 24 hours.
I hope you find this helpful. Thanks.
You might also like
How to build your own Clubhouse - Part 2
How to Build your own Clubhouse
How AI Can Enhance Your Product and Customer Experience
A deep dive into implementing AI-based analytics to help transform your product experience and build strong brand loyalty.Read blog
AWS re:Invent in Review - Part 3
Let's go over the all the major announcements from the Week-3 of the AWS re:Invent 2020.Read blog
Fashion E-Commerce: Using Computer Vision to Find Clothing that Fits Like a Glove
Never let online trends get in the way of creating a great outfit for yourself.Read blog
A Deep-Dive into Downtime. Why Does it Happen?
Successfully handling sales peaks while avoiding downtime should be the goal of any business. We’ll be covering every aspect of downtime in a series of posts, including details of how to build resilience into your cloud architecture – ensuring you minimize your business’ exposure to any outages.Read blog
How to Enable Public Health by Innovation in Predictive Analytics - Part 2
Is there a way to let people know of a potential infection risk before even coming into contact with each other?Read blog
AWS re:Invent in Review — Part 2
Let's go over the all the major announcements from the Week-2 of the AWS re:Invent 2020.Read blog