Design WORM models with DBT

6 min readNov 6, 2023

Write Once & Read Many times model aka WORM are a way of saving a state of data to reuse it multiple times in an environment where the source dataset constantly changes.

Data teams often struggle to perform E.T.L/E.L.T transformations and then justify the resultant numbers in an environment where the source dataset keeps changing. Data refreshing on a daily schedule is still manageable but its a challenge to keep up with a source that changes throughout the day. The problem worsens when the source table runs on a truncate load approach and lacks a date/timestamp column (happens more often than you can imagine).

WORM models can help in such scenarios wherein a snapshot of the source data is used to refresh the downstream systems. Debugging too becomes a lot easier as one can refer to the snapshot to justify the nos.

Note: DBT (Data Build Tool) does have a snapshot feature; it does however rely on a timestamp column in the source to determine if a refresh is necessary or not.

Sold! What do I have to do?

There are steps to bring this model into fruition:
a. Build the DBT model with a variable
b. Use airflow DAG to dictate how, when and how often these models run.

On the DBT side

There are a few house keeping steps to transform the DBT (Data Build Tool) model into a more dynamic model.

Setting up a variable

These variables are usually specified in the project YML file with a default value to give DBT an initial context of the data type it must expect. Additionally, DBT uses this default value during compilation.

This value will obviously be supplied from the airflow DAG and will be addressed in the DAG setup section.

For this illustration, this variable will be used by DBT (Data Build Tool) to not only create a snapshot but also fetch data from that same snapshot(once the snapshot has been created).

Feel free to check out project variables in DBT via this link.

2. Specifying source

DBT needs to know the source tables involved and any other optional information that could give it more context.

I have kept all the source tables in a separate folder (aptly named) source. Thus all the materialization and tagging info for any such tables is kept under the folder name inside the project YML file.

Tags have been used here to group all the necessary tables into a tag and invoke them via the run command. This logical grouping can be easily performed in the project YML file.

Thus a tag name gives DBT an idea of all the tables that we wish to be executed.

For this illustration complete_refresh tag value has been provided to the tables housed within the source folder. Thus, DBT will re-create/refresh all the tables within the source folder when complete_refresh tag value is passed during the dbt run.

Feel free to check out tagging in DBT via this link.

3. Model

The model file here is a simple incremental one, devoid of any Jinja template code.

There are 2 model files involved here:

Snapshot table: CUSTOMER_SNAPSHOT

Customer snapshot table

Final table: TPCH_WORM_MODEL

Target table that uses the customer snapshot table

Notice that the snapshot table is incremental in nature and is refreshed by a date field named snapshot_date. It is this field that makes CUSTOMER_SNAPSHOT table a WORM table.

For a particular day the workflow will take a snapshot of the ever changing TPCH’s Customer model and reuse the contents of this table for multiple refresh cycles for the downstream target tables.

The downstream target table named TPCH_WORM_MODEL (naming needs some work) references the Snapshot table for metrics from a specific date.

Note: For this to be a WORM; the DAG must execute the Snapshot model only once(in a day or during the snapshot phase). That will be addressed in the DAG creation section.

Now for the scheduling part

For this model to be a WORM; Airflow must take an additional responsibility of determining a particular date/time to execute the snapshot model for the subsequent read runs.

For a better illustration I have created 2 branches using the Shortcircuit (link) operator.

The ShortCircuit operator switches between the snapshot branch and the read branch based on the airflow variable value we pass into the DAG.

Snapshot Action: Is a normal run command refreshing both the customer_snapshot table and the tpch_worm_model table for date value passed via the DT variable.

Executing the snapshot branch of the DAG

Underlying code can be seen in the dbt_run_snapshot_mode task. git code link

dbt_run_snapshot_mode_task

Contribute to christolagali/airflow-dbt-orchestrate-project.

github.com

The bash command that airflow will need to invoke would look as such:

"""
dbt run 
    --profiles-dir {DBT_CONNECTION_PATH} 
    --project-dir {DBT_PROJECT_PATH} 
    --select tag:complete_refresh --vars '{{refresh_date: { DT } }}' 
"""

This command results in the following log once its executed:

Both the snapshot table as well as the target table are refreshed

Read Action: Is a run command that excludes the customer_snapshot table from being refreshed and moves on to the tpch_worm_model table.

At this point tpch_worm_model model uses the data that already exists within the customer_snapshot table(under the specific date).

Can be seen in the dbt_run_read_mode task. git code link

dbt_run_read_mode_task

Contribute to christolagali/airflow-dbt-orchestrate-project.

github.com

"""
  dbt run 
    --profiles-dir {DBT_CONNECTION_PATH} 
    --project-dir {DBT_PROJECT_PATH} 
    --select tag:complete_refresh 
    --exclude dbt_worm_based_model_project.source.SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER_SNAPSHOT 
    --vars '{{refresh_date: { DT } }}' """

This command results in the following log once its executed:

Notice that:

a. The read task leverages DBT’s exclude command to avoid refreshing the Customer_Snapshot table.

b. Only the target table i.e. TPCH_WORM_MODEL gets executed from among the 2 models.

These 2 tasks together emulate the working of a WORM model.

Note: Snapshot action must always be performed once before the subsequent read actions. The read actions performed throughout the rest of the day will use the data from the snapshot table.

Based on the use case, the Snapshot action can either be performed once per day or once per month and the reads can be done N number of times after that.

Illustration

As an illustration I performed the snapshot and read runs for the past 2 days; results for which can be viewed in snowflake.

For a load that runs on a daily cadence the customer snapshot table should look as such:

and the tpch worm table should look as such:

Notice the change in the timestamp of the tpch worm model table for the 4th of Nov after a second run; while the count of records remained the same.

This is the power of using a Write Once and Read Many (WORM) data table which will provide consistent results in an environment where data keeps changing.

Important links

airflow-dbt-orchestrate-project/dags/dbt/dbt_worm_based_model_project at…

Contribute to christolagali/airflow-dbt-orchestrate-project development by creating an account on GitHub.

github.com

Exclude models from your run | dbt Developer Hub

Excluding models

docs.getdbt.com

Orchestrate DBT Core jobs with Airflow

Introduction

christo-lagali.medium.com

airflow.operators.python - Airflow Documentation

Use airflow.decorators.task() instead, this is deprecated. Calls @task.python and allows users to turn a Python…

airflow.apache.org

Design WORM models with DBT

Sold! What do I have to do?

On the DBT side

Now for the scheduling part

dbt_run_snapshot_mode_task

Contribute to christolagali/airflow-dbt-orchestrate-project.

dbt_run_read_mode_task

Contribute to christolagali/airflow-dbt-orchestrate-project.

Illustration

Important links

airflow-dbt-orchestrate-project/dags/dbt/dbt_worm_based_model_project at…

Contribute to christolagali/airflow-dbt-orchestrate-project development by creating an account on GitHub.

Exclude models from your run | dbt Developer Hub

Excluding models

Orchestrate DBT Core jobs with Airflow

Introduction

airflow.operators.python - Airflow Documentation

Use airflow.decorators.task() instead, this is deprecated. Calls @task.python and allows users to turn a Python…

Written by Christopher Lagali

No responses yet