How to deploy Azure Data Factory and Data Pipelines using PowerShell
In this blog, we will walk through automation script for deploying Azure Data Factory (ADF), data pipelines & its entities (i.e. Linked Services, Data sets) to copy data from Azure virtual machine to Azure Storage account.
In this article, we will perform following steps:
- Create Azure Data Factory
- Create Azure VM Linked service
- Create Azure Storage Linked service
- Create source File Share data set
- Create target Storage data set
- Create a pipeline with a copy activity to move the data from file to storage account.
- Azure subscription: If you don't have an Azure subscription, create a free account before you begin
- Azure Virtual machine: Using csv file from Azure VM as the source data store to copy data from. If you don't have VM Setup, see the instructions to Create a Azure virtual machine
- Azure storage account: Using Table storage as the target data store. If you don't have an Azure storage account, see the instructions to Create a storage account.
Azure PowerShell is a set of cmdlets(lightweight command used in windows powershell environment) for managing Azure resources directly from the PowerShell command line. It simplifies and automates to create, test, deploy Azure cloud platform services using PowerShell.
Before starting to install and run the latest Azure PowerShell module, use the following command to sign into the Azure portal. It will prompt you to enter username & password.
Create Azure Data Factory:
Azure Data Factory is a cloud-based data integration service that allows to create data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.
The Data Factory services allows us to create Pipelines which help to move and transform the data, Pipeline can have one or more activities to perform move/transform data.
Let's start creating data factory
- To Create an Azure resource group, run the following commands
- To Create an Azure Data Factory run the New-AzDataFactory cmdlets
Finally we can now see Data Factory created on azure portal.
In the next steps, we will create data pipeline & its dependencies for copying data from azure virtual machine to Azure table storage.
Before creating a pipeline, we need to create a data factory entities i.e in order to copy file data from Azure VM directory file path to Azure Storage Table, we have to create two linked services: Azure Storage and Azure VM File system. Then, create two data sets: Azure storage output data set (which refer to the Azure Storage linked service) and Azure VM FileShare Source data set (which refer to the Azure VM directory path linked service). The Azure Storage and Azure VM File system linked services contain connection strings that Data Factory uses at run time to connect to your Azure Storage and Azure VM respectively and also in order to run copy activities between a cloud data store, we have to install Integration run time. The installation of Self hosted integration run time needs on an on-premise machine or an azure virtual machine. I have installed on virtual machine & please follow this Microsoft documentation to install IR via Azure Data Factory UI
Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources i.e. ADF manages activities across a variety of data sources such as Azure Storage, Azure SQL, Azure Data Lake, SSIS, on-premises systems, and more. To communicate with these sources, ADF requires a connection definition known as a Linked Service.
Create Azure VM Linked Service:
AzureVM Linked Service is to link VM Directory file path to our data factory. This is used as Input (Source) store. This linked service has the AzVM connection information that the Data Factory service uses at run time to connect to it.
- Define a JSON named ‘FileCopy_VM_LinkedService' with connection information properties
- Run the following Set-AzDataFactoryV2LinkedService cmdlets to create Linked service for AzureVM
Create Azure Storage linked service:
Azure Storage Linked service is to link storage account to our data factory ‘DemoADF'. This is used as output(sink) store. The linked service has the Storage account connection key information that the Data Factory service uses at run time to connect to it.
- Define a JSON named ‘StorageLinkedService' with connection string properties
Replace account name & account key with your Azure Storage account
- Run the Set-AzDataFactoryV2LinkedService cmdlets to create Linked Service for Azure storage account
We can now see two Linked Services created on data factory.
Data set is a named view of data that points /references the data want to use in our activities as input and output. Data sets identify data within different data stores, such as tables, files, folders, and documents. i.e. a data set is not the actual data, It is just a set of JSON instructions that defines where and how the data to store.
In our scenario, we need to create two data sets one for pointing to Azure storage linked service(output) & another one pointing to AzureVM Linked service(input).
Create Azure VM FileShare dataset:
- Define Source Data Set json named ‘Source_VM_Dataset'. This data set specifies the directory .txt file path on AzureVM to which the Source data is to pull from.
- Run the following Set-AzDataFactoryV2Dataset cmdlets to create Source data set
Create Azure storage Table data set
- Define Output DataSet json named ‘Output_Storage_DataSet'. This data set specifies the Azure Table Storage on storage account to which the data is to be copied.
- Run the following Set-AzDataFactoryV2Dataset cmdlets to create output data set
We can now see source & target data sets created on data factory.
A pipeline is a logical grouping of activities that together perform a task. The pipeline allows you to manage the activities as a set instead of each one individually. i.e. pipeline operates on data to transform & to analysis data on top of it.
- Define a JSON named ‘Demo_Copy_pipeline' that consist of copy activity that uses source & target data set to copy the data from VM to Storage account.
- Run the following Set-AzDataFactoryV2Pipeline cmdlets to create pipeline with copy activity
We can now see pipeline created on data factory.
To invoke/run the pipeline we can add schedule triggers or event triggers, to demonstrate now, we will invoke the pipeline using powershell cmdlets.
Pipeline invoke generates a run id for each specific run & we can see successfully executed pipeline run from ADF monitor tab & final target output on Azure storage account.
Also, the data on the final destination target table on Azure Table storage account:
In this blog, we have seen Azure Data Factory concepts and the ways we could create data pipelines, Linked Services, data sets and invoke using PowerShell. If you want to read more about Azure PowerShell & Data Factory refer to below documentation
You might also like
How AI and Personalized Marketing are Transforming Retail Sales
How AI/ML, CDP, personalization, and BI are revolutionizing retail, fashion, and beauty. Dive into brand examples from Sephora, ThredUp, and H&M.Read article
19 Cloud Computing Statistics You Need to Know in 2023
By 2025, over 100 zettabytes of data will be stored in the cloud—50% of all global data storage.Read article
5 Ways to Transform Grocery Retail with an AI-Driven Data Strategy
Explore 5 AI-driven data strategies for grocery retail. Learn how to solve challenges like workforce management, pricing, and disconnected CX.Read article
Copilot and the Future of AI-Assisted Coding: Insights from a Software Engineer
GitHub's Copilot is an IDE-integrated tool that streamlines coding by offering real-time text completion suggestions, predicting what developers might type next.Read article
Data Mesh and Event-Driven Architecture: Unleashing Healthcare Data Potential
Harness the power of data mesh and event-driven architecture. Streamline healthcare data management, drive innovation, and unlock lasting growth.Read article
Unlocking Hidden Revenue: Optimizing Data Systems for a Leading Car Retailer
Explore how a leading care retailer teamed up with us to optimize workflows, boost security, and uncover hidden revenue.Read article
Rapid Migration of a Legacy Java Monolith for a Major Retail Brand
Find out how we helped a grocery giant tackle a two-year standstill and the loss of key team members to conquer a critical deadline.Read article
How an HC & Tech Services Provider Optimized Scalability, Agility, and Security
Discover how a human capital and tech services provider unlocked the power of cloud infrastructure to optimize scalability, agility, and security.Read article
Egen Cuts Sifter's Infrastructure Costs by Over 30%
Learn how Sifter, an online grocery platform, reduced infrastructure costs by over 30%, allowing the startup to focus on growth and profitability.Read article