Design WORM models with DBT

Christopher Lagali
6 min readNov 6, 2023

--

Write Once & Read Many times model aka WORM are a way of saving a state of data to reuse it multiple times in an environment where the source dataset constantly changes.

Data teams often struggle to perform E.T.L/E.L.T transformations and then justify the resultant numbers in an environment where the source dataset keeps changing. Data refreshing on a daily schedule is still manageable but its a challenge to keep up with a source that changes throughout the day. The problem worsens when the source table runs on a truncate load approach and lacks a date/timestamp column (happens more often than you can imagine).

WORM models can help in such scenarios wherein a snapshot of the source data is used to refresh the downstream systems. Debugging too becomes a lot easier as one can refer to the snapshot to justify the nos.

Note: DBT (Data Build Tool) does have a snapshot feature; it does however rely on a timestamp column in the source to determine if a refresh is necessary or not.

Photo by Yes and Studio on Unsplash

Sold! What do I have to do?

There are steps to bring this model into fruition:
a. Build the DBT model with a variable
b. Use airflow DAG to dictate how, when and how often these models run.

On the DBT side

There are a few house keeping steps to transform the DBT (Data Build Tool) model into a more dynamic model.

  1. Setting up a variable

These variables are usually specified in the project YML file with a default value to give DBT an initial context of the data type it must expect. Additionally, DBT uses this default value during compilation.

Project Variable

This value will obviously be supplied from the airflow DAG and will be addressed in the DAG setup section.

For this illustration, this variable will be used by DBT (Data Build Tool) to not only create a snapshot but also fetch data from that same snapshot(once the snapshot has been created).

Feel free to check out project variables in DBT via this link.

2. Specifying source

DBT needs to know the source tables involved and any other optional information that could give it more context.

I have kept all the source tables in a separate folder (aptly named) source. Thus all the materialization and tagging info for any such tables is kept under the folder name inside the project YML file.

Tags have been used here to group all the necessary tables into a tag and invoke them via the run command. This logical grouping can be easily performed in the project YML file.

Thus a tag name gives DBT an idea of all the tables that we wish to be executed.

For this illustration complete_refresh tag value has been provided to the tables housed within the source folder. Thus, DBT will re-create/refresh all the tables within the source folder when complete_refresh tag value is passed during the dbt run.

Feel free to check out tagging in DBT via this link.

3. Model

The model file here is a simple incremental one, devoid of any Jinja template code.

TPCH Lineage graph

There are 2 model files involved here:

  • Snapshot table: CUSTOMER_SNAPSHOT
Customer snapshot table
  • Final table: TPCH_WORM_MODEL
Target table that uses the customer snapshot table

Notice that the snapshot table is incremental in nature and is refreshed by a date field named snapshot_date. It is this field that makes CUSTOMER_SNAPSHOT table a WORM table.

For a particular day the workflow will take a snapshot of the ever changing TPCH’s Customer model and reuse the contents of this table for multiple refresh cycles for the downstream target tables.

The downstream target table named TPCH_WORM_MODEL (naming needs some work) references the Snapshot table for metrics from a specific date.

Note: For this to be a WORM; the DAG must execute the Snapshot model only once(in a day or during the snapshot phase). That will be addressed in the DAG creation section.

Now for the scheduling part

For this model to be a WORM; Airflow must take an additional responsibility of determining a particular date/time to execute the snapshot model for the subsequent read runs.

For a better illustration I have created 2 branches using the Shortcircuit (link) operator.

DAG Graph with 23 branches of execution

The ShortCircuit operator switches between the snapshot branch and the read branch based on the airflow variable value we pass into the DAG.

Snapshot Action: Is a normal run command refreshing both the customer_snapshot table and the tpch_worm_model table for date value passed via the DT variable.

Executing the snapshot branch of the DAG

Underlying code can be seen in the dbt_run_snapshot_mode task. git code link

The bash command that airflow will need to invoke would look as such:

"""
dbt run
--profiles-dir {DBT_CONNECTION_PATH}
--project-dir {DBT_PROJECT_PATH}
--select tag:complete_refresh --vars '{{refresh_date: { DT } }}'
"""

This command results in the following log once its executed:

Both the snapshot table as well as the target table are refreshed

Read Action: Is a run command that excludes the customer_snapshot table from being refreshed and moves on to the tpch_worm_model table.

At this point tpch_worm_model model uses the data that already exists within the customer_snapshot table(under the specific date).

Executing the Read branch of the DAG

Can be seen in the dbt_run_read_mode task. git code link

"""
dbt run
--profiles-dir {DBT_CONNECTION_PATH}
--project-dir {DBT_PROJECT_PATH}
--select tag:complete_refresh
--exclude dbt_worm_based_model_project.source.SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER_SNAPSHOT
--vars '{{refresh_date: { DT } }}' """

This command results in the following log once its executed:

Only the final table gets refreshed

Notice that:

a. The read task leverages DBT’s exclude command to avoid refreshing the Customer_Snapshot table.

b. Only the target table i.e. TPCH_WORM_MODEL gets executed from among the 2 models.

These 2 tasks together emulate the working of a WORM model.

Note: Snapshot action must always be performed once before the subsequent read actions. The read actions performed throughout the rest of the day will use the data from the snapshot table.

Based on the use case, the Snapshot action can either be performed once per day or once per month and the reads can be done N number of times after that.

Illustration

As an illustration I performed the snapshot and read runs for the past 2 days; results for which can be viewed in snowflake.

For a load that runs on a daily cadence the customer snapshot table should look as such:

and the tpch worm table should look as such:

Notice the change in the timestamp of the tpch worm model table for the 4th of Nov after a second run; while the count of records remained the same.

This is the power of using a Write Once and Read Many (WORM) data table which will provide consistent results in an environment where data keeps changing.

--

--

Christopher Lagali
Christopher Lagali

Written by Christopher Lagali

An Enthusiastic Data Eng. who is on a mission to unravel the possibilities of pipeline building with AWS and who believes in knowledge sharing.

No responses yet