Azure End-to-End Data Engineering Project: Incremental data pipeline [Part 1]
Using Azure SQL DB, Azure Data Factory, Databricks, Delta Lake, Power BI
In this project you will learn the following concepts:
- Data modeling — star schema (Fact & Dimensions modeling)
- Slowly changing dimensions handling & Change Data Capture (CDC)
- Data Design Pattern: Medialion Architecture
- Azure Services for Data Engineering
Pre-requisites:
To get started, log in to your Azure account.
Step 1: Create a Resource Group
Starting with a resource group in Azure is not just a best practice but a foundational step toward effective cloud resource management. Resource groups enhance organization, improve security through access control, facilitate cost tracking, enable consistent deployments, and allow for environment isolation.
Step 2: Create a Storage Account (Datalake)
An Azure storage account contains all your Azure Storage data objects: blobs, files, queues, and tables. The storage account provides a unique namespace for your Azure Storage data accessible from anywhere in the world over HTTP or HTTPS.
Step 3: Create a Data Factory
Data Factory provides a data integration and transformation layer and you can use it to create ETL and ELT pipelines.
Step 4: Create an Azure SQL Database
Azure SQL allows you to create and manage your SQL Server resources from a single view, ranging from fully managed PaaS databases to IaaS virtual machines with direct OS and database engine access.
Continue with the steps to create a managed Azure SQL Database
Step 5: Create Containers in the Data Lake
As we are following the Medallion Data Design pattern, create three containers:
- Bronze for the raw data
- Silver for the transformed data
- Gold for the aggregated data
Step 6: Create a Table Schema in the Database
Navigate to the created database and click on Query Editor, you will be forwarded to the login interface, where you need to specify your admin credentials.
Below is an overview of the source data:
The next step is to create a table with the appropriate schema for the source raw data by creating a new query.
Step 7: Ingest raw data with Data Factory
Launch Data Factory navigate to the Author tab and create a pipeline called ‘data_source_prep’ What we will do is copy data from GitHub to Azure SQL DB using Data Factory.
Then navigate to the Manage tab and click on Linked Services to create an HTTP connection to GitHub (where we have the source data) and this can be any other website from which you read raw data.
The second linked connection will be the Azure SQL database to write the ingested data to the SQL database.
When you test the connection, you might get a connection error because of the firewall to protect the access to the database. To fix that error, navigate to the networking settings in the SQL server and click on “Enable Azure service to access this server” which you should find at the end of the page.
Dynamic ETL: parametrized dataset
After setting up the Linked Services, it is time to configure the data pipeline to create a dynamic dataset with a parameter of the file name.
To do that, go to the Author tab and start configuring the Copy Data activity by first adding a new dataset in the source section, typing new, selecting an HTTP data store, and since the data is a CSV, selecting ‘Delimited Text’.
Then we need to create the parameter for the dataset (file name)
Then navigate to the connection tab and click on Add dynamic content to edit the relative URL and add the parameter that was just created.
This way we created the dynamic data source, the next step is to configure the sink which is the destination where we will load the data in Azure SQL.
After configuring the sink, click on Debug to run the pipeline and then click on Publish All to save the work.
Then test that the copy data pipeline worked correctly by querying the data in the dataset.
Step 8: Incremental data Loading
In this step, we need to load new data incrementally and automatically. To do that we will need to create two pipelines. One for the initial load and one for the incremental load and we create two parameters to save the current load data and the last load date.
Create a Watermark table to store the last load date identifier.
Now create a stored procedure to update this value again & again in the watermark table.
The next step is to add two activities “Lookup”, one that captures the last_load date and the other that captures the current_load date.
In the configuration of the lookup, select the dataset in the settings tab and then create a parameter for the table_name (which will be used in both Lookups).
Let’s continue with the configuration of the Lookup activity ‘last_load’ to get the last load value via a query.
Then, configure the settings of the Lookup “current_load” as shown below.
Then deactivate the other Activities and click on Debug to check the output of that activity. Afterward, connect the two lookups to the copy data activity on success by dragging the green check button.
Then we start to configure the Copy Data activity in which we will set up the copy_incremental_data. Then add the needed info in the source tab and then add dynamic content to add the parameters of the activity outputs.
since the output of the lookup activities is as follows:
To get the max_date we need to write output.value[0].max_date and we do the same with output.value[0].last_load to be replaced in the parameter.
The next step is to add the sink information and at this level, we want to save the data to the Datalake (Bronze layer).
The data source would be Azure Datalake Storage Gen 2. Choose the format Parquet to save the data. Call the dataset ds_bronze create a linked service and link the Datalake we created at the beginning.
Then the last step in the sink is to set the properties, as follows:
Then click on Debug to test the pipeline.
At this stage, we can move to the next step by adding a stored procedure activity to the pipeline to update the watermark_table with the last_load with the current_load (max_date) once the data is incremented.
After configuring the stored procedure activity, we debug the whole pipeline which dynamically increments data load.
At last make sure to publish the pipelines in order to save the work after we validated all the steps.
This was it for the first part of the project of ingesting the data from the source (GitHub) into Data Factory and creating a data pripline to dynamically load incremental data and saved to to the bronze database.
Next we will cover how to use databricks to transform the data and create the start schema to model the data which will be used for analysis.
Thank you for reading and see you in part 2 of this end to end DE project.