Orchestrate DBT Core jobs with Airflow
Introduction
DBT has become an integral part of a Data Engineer’s Tech stack in recent years.
The ability to shove SQL Scripts into a file which would then get executed in a hierarchical manner is a genius move! Couple that capability with Airflow and your Data Engineers are the happiest lads in town.
I wish to bring this good news and cheer in teams who have not yet experienced this Nirvana!
Setting things up locally!
DBT needs a few components in place which can be daunting for first time users. Thanks to the setup wizard provided by the good folks at dbt labs; this can be achieved in a couple of steps.
To get started locally you will need to:
a. Setup DBT Core (on your local machine) link
b. Create your first DBT project link
While these steps will get you up and running on your local machine; it does however take some extra steps to get it running on a remote airflow container.
Organization is the key!
To get things rolling; it is a good practice to keep all your DBT projects in a separate folder under the dags folder. Debugging critical issues becomes quicker when you know where to look.
Setting up a connection between your project and the database is the next step; DBT will start to fire queries as soon as we perform the first run.
You must have performed this step by creating a profiles.yml file in the .dbt folder (that is hidden away in your local user’s folder).
The location of this file is crucial when you are dealing with a container. I have thus placed this file in a separate folder named connection_profiles within your dags/dbt folder.
Fret not! The location for this YML file will then be passed as an argument from within the DAG to the DBT engine.
Compiling the project
Feel free to glance at the GitHub code to follow along.
To get around understanding the project structure and the effects of compiling the program; we must first inspect the project folder.
Notice the target folder which houses the compiled and the run folders!
These folders (though initially empty) will be a good place to search when the error on the command line interface/terminal doesn’t makes sense.
For this demonstration I have a SQL that uses TPCH sample data to calculate the Amount sold by each Clerk (identified by Clerk Id) for a given Period.
Let’s run dbt compile to see the effects of this step:
Done signifies a successful compilation; result for this should be visible within the compiled directory:
Notice that the references to the tables and variables have been replaced with actual table names.
This is a good place to checkout the resultant SQL and also verify if DBT is able to communicate with your database.
Note: It is also helpful to copy and execute this query on your SQL workbench to validate the data.
Executing the Project
Performing a dbt run is the actual step where data movement occurs from the source to the target tables. DBT Core engine uses the SQL housed within the Compiled folder to create a temporary table within Snowflake.
Glancing at the run folder:
The insert statement will then move the data from the temp table that DBT creates to the preferred target table.
Setting up the actual DAG
After a few runs on the local machine; it is natural to have a scheduler that can automate this process for us.
There are 2 ways of establishing a connection between airflow and DBT
1. Bash Operator
2. Astronomer’s Cosmos
Curious about Cosmos and it’s advantages? There is an upcoming article to explore and address that approach too (Coming Soon).
For the purposes of this walk through and for simplicity we will be using the Bash operator.
DAG File
Airflow scans an a-cyclic graph authored in Python code to form a workflow.
We will thus have 3 tasks:
1. Dummy Start task
2. DBT module task
3. Dummy End task
Since the bash operator is a black box for airflow; we will need to provide the following:
- project directory
- location of the profiles.yml file
- model name we wish to execute
The resultant bash command will look as follows:
"""
dbt run
--profiles-dir {DBT_CONNECTION_PATH}
--project-dir {DBT_PROJECT_PATH}
--select tpch_model.tpch_model
"""
.format(DBT_CONNECTION_PATH=DBT_CONNECTION_PATH,
DBT_PROJECT_PATH=DBT_PROJECT_PATH)
Essentially, the command first gives DBT engine a context of the project location and the connection details followed by select keyword and the name of the model.
Do feel free to try out other commands as you play along. list of commands
Run it….Run it….Run it
Notice that the steps to schedule a dbt run on airflow are minimal once your SQL models are already tried and tested.
Running it also is a piece of cake!
The cool aspect of this integration is that, DBT passes on all the logs it generates to the airflow UI that makes our life much easier!
Its easy to forget about Snowflake whilst setting up DBT and the Airflow DAG code. However the database is the final validator of the models being used.
Executing a simple SQL against the target table should do the trick.
Viola! Airflow can now use DBT to organize and refresh your models on a desired schedule.
For folks who love the “run it and leave it” methodology; this approach is a no-brainier as it is easy to implement and maintain.
For others who wish to explore and leverage DBT’s advanced features with airflow, can follow me on a journey that I am about to embark.