Data Warehousing (DWH) best practices

Rihab Feki
4 min readDec 13, 2024

--

How to model data in a data warehouse

Photo by Choong Deng Xiang on Unsplash

If you are new to DWH aspects and data modeling, this article inspired by Project A — data modeling best practices (Part I and Part II) will cover in depth knowledge to get you started in the data engineering field.

Methodology:

The first thing you need to do as a data engineer is to learn how to derive analytical entities from business questions. So you will be evaluating the following two paths:

  • Is the data I have enough to answer the business question?
  • How can I extend the data models to answer the business question?

Oftentimes, you will be asked by a stakeholder to create a dashboard. You will have the option to do it quick & dirty or correct & scalable.

The quick and dirty path (without data modeling)

Option 1: Reporting tool directly on data sources

Option 2: Move data into a database, then analyze in the reporting tool

Option 3: Create Ad hoc views and tables for specific charts/dashboards

This method has its cavity, which is:

  • The single source of truth is hard to maintain e.g. different business logic or metrics definitions across dashboards and teams and low transparency into data lineage.
  • It is hard to scale with business complexity, e.g., it is difficult to update business logic across dashboards. Things can get messy when data needs to be joined from different data sources.
  • Data siloed in the reporting tool

Data modeling: Kimball-style dimensional data modeling

This approach consists of creating a unified analytical model & activating data across the whole companies

  • Model data as Star or Snowflake schema
  • Normalized data (3NF)
  • Referential integrity
  • Consistent naming

Let’s get into a practical example. Your boss asks: I want to know the revenue!

Your task:

  • Identify the business processes/entities that create revenue
  • Anticipate the business questions that users will have related to revenue
  • Align on good labels like “customer” or “revenue” and
  • Find a general representation ( Dimensional model = fact tables + dimension tables) for answering all business questions at once.

Rules of Thumb:

  • if the entity is fundamental, it is better to pre-compute it in the database instead of having to do it in the reporting tool which will happen in memory.
  • The mental model for translating business questions into a dimensional model:
    - each table has a fixed set of fields
    - Simple aggregations on a single field (AVG, COUNT, SUM, etc.)
    - Filtering (filtering by date, season, location, etc)
    - Grouping (grouping by product, payment method, etc)

Other considerations in your data modeling methodology are:

  • Time perspectives: most business processes happen over time like events e.g. placing orders, shipping, etc.
  • Changes of entities over time: the latest/current state is not enough e.g. order status (places, paid, shipped), customer funnel change (new, marketing qualified, etc).
  • Changes of metrics over time: like a canceling of an order

The previous phase belongs to the requirements gathering before the actual modeling starts. Now we will go through the data modeling.

Dimensional modeling = relations between entities

Oftentimes, data in the DWH is modeled differently than in the data sources this is typically because we are moving from OLTP to OLAP setup. When designing a data warehouse, the data engineer does the data normalization which consists of creating entities based on analytical needs. The entities need to be driven by what you want to report on e.g. if the analytics team asks to segment the customers by product category then the product category should be a separate entity.

data pipelines as code:

The mainstream in data engineering became to use data pipelines as code, especially with tools like dbt an open-source transformation workflow tool. A rule of thumb is that the data pipelines should be reproducible, idempotent, and immutable. This means running a pipeline on the same input will produce the same result on that input data.

Leave data in the database: SQL vs Pandas

If you have to choose between doing data analysis with Pandas or SQL, then use SQL as the data processing language. This is because databases grant faster computing and are less memory intensive.

I hope this article was useful and stay tuned for more articles on data engineering.

--

--

Rihab Feki
Rihab Feki

Responses (3)