Backing up AKS cluster with Velero
Let's walk through the backup process of a Kubernetes cluster and its persistent volumes using Velero (previously known as Ark).
Docker and Kubernetes are the modern-day runtime environments in almost every company who aims to implement cutting edge technology. It gives so many conveniences out of the box like rolling deployments, high availability, restarting failed containers (aka self-healing), well-managed secrets, and the list goes on...
Kubernetes also provides something called persistent volumes meaning storing container data outside the container. So if the container dies, a new container is brought up and is linked to the persistent volume without losing data. Isn't that great? well, Kubernetes does that for you. If Kubernetes is doing so much why do you need something like Velero to do the same thing? Of course, there's more to it. What if the whole node dies ? or one of the developers accidentally deletes the nodes (happens more often than you think. :P guilty). Then you are losing persistent volumes as well and if the whole cluster dies nobody knows what was the state of different pods and data in it was just before it dies.
Well, there comes Velero. Velero gives you tools to back up and restore your Kubernetes cluster resources and persistent volumes. here's what Velero can do for you.
- Take scheduled backups of your cluster and restore it in case of loss.
- Migrate cluster resources to other clusters.
- Replicate your production cluster to development and testing clusters.
How Velero works:
Not going full Rambo here. just a quick overview. Kubernetes stores cluster state (basically the k8s API objects) in the folder within the cluster /etcd.
Velero keeps backup of this folder in some remote location of your choice. The only condition is these remote locations should be object stores. Common examples are AWS S3 buckets and Azure Blob Storage. Now even if your cluster dies you know the state through these remote locations and can be restored, again with the help of Velero.
Common Use Case:
Say you have an Elasticsearch cluster running within your Kubernetes cluster. This cluster has 100s of GBs of data stored on persistent volumes. Valero will store k8s API objects in the container blob storage and persistent volumes will be stored as a snapshot. (say we are using Azure)
A simple command can fetch and restore these files from container blob storage which has pointers to these snapshots and viola! your cluster is up and running.
There are two steps for using Velero: first, install the Velero binary on your host machine and then run the Velero command to install it on your cluster.
to install it on Mac it's pretty straight forward.
brew install velero
to check if the installation was successful try running
Installing Velero on AKS cluster
- a blob storage container where Velero can store the backups
# name of the Storage where storage container is export AZURE\_STORAGE\_ACCOUNT\_ID="velerobackup" # Name of the storage account's Resource Group # I have a very good reason to call it RG and not Resource Group # you will learn it soon export RG="backup" # name of your backup container export BLOB\_CONTAINER="aks-backup" # This resource group is different than the one you set above # Explanation after these commands AZURE_RESOURCE_GROUP=$(az aks show --query nodeResourceGroup --name <AKS-CLUSTER-NAME> --resource-group <AKS-CLUSTERS-RESOURCE-GROUP> --output tsv) # Azures subscription ID you are working in AZURE_SUBSCRIPTION_ID=$(_az_ account list --query '\[?isDefault\].id' -o tsv) # Azures Tenant ID AZURE_TENANT_ID=$(_az_ account list --query '\[?isDefault\].tenantId' -o tsv)
Did you notice there are two different resource groups in the above commands? well if you did congrats!!! You just saved at least an hour.
The reason behind this is when you spin up an AKS cluster in a resource group of your choice, behind the scene, Azure creates another resource group, it is the “cluster resource group” and is used to represent and hold the lifecycle of resources underneath it. This is weird but this is by design and we should deal with it. more this here.
Next, you will need a service principle that will allow the AKS cluster to read and write files to this storage account.
# get service principles password AZURE_CLIENT_SECRET=$(_az_ ad sp create-for-rbac -n $AZURE\_STORAGE_ACCOUNT_ID --role contributor --query password --output tsv) #get service principles ID AZURE_CLIENT_ID=$(az ad sp show --id http://$AZURE_STORAGE_ACCOUNT_ID --query appId --output tsv)
By now you have all the variables required to install Velero on the AKS cluster. Next, you'll dump all the required values from above into a file called credentials-velero. You can name it whatever you want just make sure to update the commands accordingly.
_echo_ "\ AZURE\_SUBSCRIPTION\_ID=$AZURE\_SUBSCRIPTION\_ID \n\ AZURE\_TENANT\_ID=$AZURE\_TENANT\_ID \n\ AZURE\_CLIENT\_ID=$AZURE\_CLIENT\_ID \n\ AZURE\_CLIENT\_SECRET=$AZURE\_CLIENT\_SECRET \n\ AZURE\_RESOURCE\_GROUP=$AZURE\_RESOURCE\_GROUP" \ _\> ./credentials-velero
Prep work is done. Now comes the crucial part. Installing Velero on AKS cluster.
velero install \ --provider azure \ --plugins velero/velero-plugin-for-microsoft-azure:v1.0.1 \ --bucket $BLOB\_CONTAINER \ --secret-file ./credentials-velero \ --backup-location-config resourceGroup=$RG,storageAccount=$AZURE_STORAGE_ACCOUNT\_ID,subscriptionId=$AZURE_SUBSCRIPTION_ID \ --snapshot-location-config resourceGroup=$RG,subscriptionId=$AZURE_SUBSCRIPTION_ID
In the above command, you are passing velero-credentials file as a value to argument secret-file. You are also specifying where to store cluster state (CRD) and where to put snapshots of persistent volumes.
After running this command you will see a long list of resources created in your AKS cluster. Most of which you don't need to understand. Everything will be created in a newly created namespace called velero.
What you should look for is if the last line says
Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.
If you see this you are good.
Now one crucial step is tagging your resource which needs to back up. whattttt??? doesn't Velero back up everything? well, it does, but not persistent volumes. you have to explicitly specify persistent volumes of which pods to back up.
you will have to specify a label on the pods to tell Valero to back up their persistent volume as a snapshot. A label can be anything that makes sense to you. we will use backup=true for our purpose.
kubectl label pods <NAME-OF-POD> backup=true
finally, you can backup. here's a command.
velero backup create <NAME> — selector backup=true
This will back up every pod who has label backup=true in its metadata along with CRDs in etcd folder. Visit your storage container and snapshots to verify your backup is successful.
PS: It takes a while for snapshot backups to show up in snapshots listing in Azure. In my case, it was 5 minutes.
To list your backup in the terminal run
velero get backups
you should see one backup.
To restore this backup you will simply have to run
velero restore create — from-backup <NAME>
This will restore from a specific backup.
To test this out, Try deleting a deployment after backing up the cluster and restoring it.
Usually, you would want to run backups in a timely manner. Velero provides a utility for that as well.
velero create schedule daily --selector backup=true --schedule="@every 24h"
This will create a scheduled backup that runs every 24 hours.
I hope you find this helpful. Thanks.
You might also like
How AI and Personalized Marketing are Transforming Retail Sales
How AI/ML, CDP, personalization, and BI are revolutionizing retail, fashion, and beauty. Dive into brand examples from Sephora, ThredUp, and H&M.Read article
19 Cloud Computing Statistics You Need to Know in 2023
By 2025, over 100 zettabytes of data will be stored in the cloud—50% of all global data storage.Read article
5 Ways to Transform Grocery Retail with an AI-Driven Data Strategy
Explore 5 AI-driven data strategies for grocery retail. Learn how to solve challenges like workforce management, pricing, and disconnected CX.Read article
Copilot and the Future of AI-Assisted Coding: Insights from a Software Engineer
GitHub's Copilot is an IDE-integrated tool that streamlines coding by offering real-time text completion suggestions, predicting what developers might type next.Read article
Data Mesh and Event-Driven Architecture: Unleashing Healthcare Data Potential
Harness the power of data mesh and event-driven architecture. Streamline healthcare data management, drive innovation, and unlock lasting growth.Read article
Unlocking Hidden Revenue: Optimizing Data Systems for a Leading Car Retailer
Explore how a leading care retailer teamed up with us to optimize workflows, boost security, and uncover hidden revenue.Read article
Rapid Migration of a Legacy Java Monolith for a Major Retail Brand
Find out how we helped a grocery giant tackle a two-year standstill and the loss of key team members to conquer a critical deadline.Read article
How an HC & Tech Services Provider Optimized Scalability, Agility, and Security
Discover how a human capital and tech services provider unlocked the power of cloud infrastructure to optimize scalability, agility, and security.Read article
Egen Cuts Sifter's Infrastructure Costs by Over 30%
Learn how Sifter, an online grocery platform, reduced infrastructure costs by over 30%, allowing the startup to focus on growth and profitability.Read article