Azure End-to-End Data Engineering Project: Incremental data pipeline [Part 1]

Rihab Feki
7 min readJan 3, 2025

--

Using Azure SQL DB, Azure Data Factory, Databricks, Delta Lake, Power BI

Project architecture

In this project you will learn the following concepts:

  • Data modeling — star schema (Fact & Dimensions modeling)
  • Slowly changing dimensions handling & Change Data Capture (CDC)
  • Data Design Pattern: Medialion Architecture
  • Azure Services for Data Engineering

Pre-requisites:

  • Create a Free Azure Account here.
  • Source raw data here.

To get started, log in to your Azure account.

Step 1: Create a Resource Group

Starting with a resource group in Azure is not just a best practice but a foundational step toward effective cloud resource management. Resource groups enhance organization, improve security through access control, facilitate cost tracking, enable consistent deployments, and allow for environment isolation.

Step 2: Create a Storage Account (Datalake)

An Azure storage account contains all your Azure Storage data objects: blobs, files, queues, and tables. The storage account provides a unique namespace for your Azure Storage data accessible from anywhere in the world over HTTP or HTTPS.

Step 3: Create a Data Factory

Data Factory provides a data integration and transformation layer and you can use it to create ETL and ELT pipelines.

Step 4: Create an Azure SQL Database

Azure SQL allows you to create and manage your SQL Server resources from a single view, ranging from fully managed PaaS databases to IaaS virtual machines with direct OS and database engine access.

Create a Server

Continue with the steps to create a managed Azure SQL Database

In networking choose a public endpoint

Step 5: Create Containers in the Data Lake

As we are following the Medallion Data Design pattern, create three containers:

  • Bronze for the raw data
  • Silver for the transformed data
  • Gold for the aggregated data

Step 6: Create a Table Schema in the Database

Navigate to the created database and click on Query Editor, you will be forwarded to the login interface, where you need to specify your admin credentials.

login to the database

Below is an overview of the source data:

Raw data structure

The next step is to create a table with the appropriate schema for the source raw data by creating a new query.

Create Table query

Step 7: Ingest raw data with Data Factory

Launch Data Factory navigate to the Author tab and create a pipeline called ‘data_source_prep’ What we will do is copy data from GitHub to Azure SQL DB using Data Factory.

Then navigate to the Manage tab and click on Linked Services to create an HTTP connection to GitHub (where we have the source data) and this can be any other website from which you read raw data.

http linked service to connect to github
make sure to test the connection before clicking on create

The second linked connection will be the Azure SQL database to write the ingested data to the SQL database.

When you test the connection, you might get a connection error because of the firewall to protect the access to the database. To fix that error, navigate to the networking settings in the SQL server and click on “Enable Azure service to access this server” which you should find at the end of the page.

Dynamic ETL: parametrized dataset

After setting up the Linked Services, it is time to configure the data pipeline to create a dynamic dataset with a parameter of the file name.

To do that, go to the Author tab and start configuring the Copy Data activity by first adding a new dataset in the source section, typing new, selecting an HTTP data store, and since the data is a CSV, selecting ‘Delimited Text’.

Then we need to create the parameter for the dataset (file name)

click on Open this dataset for more advanced configurations to add the parameter
Added the parameter ‘file_name’

Then navigate to the connection tab and click on Add dynamic content to edit the relative URL and add the parameter that was just created.

This way we created the dynamic data source, the next step is to configure the sink which is the destination where we will load the data in Azure SQL.

Sink config

After configuring the sink, click on Debug to run the pipeline and then click on Publish All to save the work.

Then test that the copy data pipeline worked correctly by querying the data in the dataset.

Step 8: Incremental data Loading

In this step, we need to load new data incrementally and automatically. To do that we will need to create two pipelines. One for the initial load and one for the incremental load and we create two parameters to save the current load data and the last load date.

Create a Watermark table to store the last load date identifier.

Anything above that DT0000 date identifier will insert all the data.

Now create a stored procedure to update this value again & again in the watermark table.

The next step is to add two activities “Lookup”, one that captures the last_load date and the other that captures the current_load date.

In the configuration of the lookup, select the dataset in the settings tab and then create a parameter for the table_name (which will be used in both Lookups).

Let’s continue with the configuration of the Lookup activity ‘last_load’ to get the last load value via a query.

Then, configure the settings of the Lookup “current_load” as shown below.

Then deactivate the other Activities and click on Debug to check the output of that activity. Afterward, connect the two lookups to the copy data activity on success by dragging the green check button.

Then we start to configure the Copy Data activity in which we will set up the copy_incremental_data. Then add the needed info in the source tab and then add dynamic content to add the parameters of the activity outputs.

since the output of the lookup activities is as follows:

To get the max_date we need to write output.value[0].max_date and we do the same with output.value[0].last_load to be replaced in the parameter.

The next step is to add the sink information and at this level, we want to save the data to the Datalake (Bronze layer).

The data source would be Azure Datalake Storage Gen 2. Choose the format Parquet to save the data. Call the dataset ds_bronze create a linked service and link the Datalake we created at the beginning.

Then the last step in the sink is to set the properties, as follows:

Then click on Debug to test the pipeline.

At this stage, we can move to the next step by adding a stored procedure activity to the pipeline to update the watermark_table with the last_load with the current_load (max_date) once the data is incremented.

After configuring the stored procedure activity, we debug the whole pipeline which dynamically increments data load.

At last make sure to publish the pipelines in order to save the work after we validated all the steps.

This was it for the first part of the project of ingesting the data from the source (GitHub) into Data Factory and creating a data pripline to dynamically load incremental data and saved to to the bronze database.

Next we will cover how to use databricks to transform the data and create the start schema to model the data which will be used for analysis.

Thank you for reading and see you in part 2 of this end to end DE project.

--

--

Rihab Feki
Rihab Feki

Responses (5)