Getting Airflow up and running on an EC2 instance — Part 1

Christopher Lagali
7 min readDec 25, 2019

--

Photo by Createria on Unsplash

Lately, many organizations have turned towards Airflow to orchestrate their Data pipelines which stresses more on programming than the traditional drag and drop functionality. For this reason and many more, Airflow has proven to be more flexible and robust with regards to kind of pipelines one can build and maintain.

One can find Airflow often installed and executed on either Docker or Kubernetes with the latter being more popular. This blog is for those who wish to install and learn Airflow on an EC2 instance before struggling with Kubernetes.

Note: We are attempting to install airflow on an EMR Master Node which is an m5.xlarge instance. However, it can also be done on a t2 micro with enough EBS storage space.

Step1: Installing per-requisites

Let’s start prepping the environment by installing certain packages that are essential for Airflow to initialize and run.

Following will be installed:

  • Git
  • Boto3
  • Developer Tools
sudo yum install -y gitsudo pip install boto3sudo yum groupinstall “Development Tools”

While git and boto3 are obvious choices to enable us to run py scripts that will interact with our ec2 instances; Developer Tools is equally important that installs gcc module for C/C++ code interpretation.

Once done, we will now proceed to install Apache Airflow.

Step 2: Install Airflow

sudo SLUGIFY_USES_TEXT_UNIDECODE=yes pip install -U apache-airflow

As seen above Flask and other dependencies are installed too.

Note: Airflow runs on a webserver named Flask that serves application pages and captures responses from users.

By Default Airflow basic package only installs the bare essentials that are necessary to get started. Sub Packages like hive, postgres and celery that enhance Airflow’s capabilities need to be specified for installation as such:

sudo pip install apache-airflow[crypto,hive,postgres]

Note: In case of doubts you can also install all available sub packages as such:

sudo pip install apache-airflow[all]

Refer to link for more details on sub packages distributed by Airflow:

EC2 alternatively needs to following packages for Airflow:

sudo -H pip install six==1.10.0sudo pip install — upgrade sixsudo pip install markupsafesudo pip install — upgrade MarkupSafe

We will now add these binaries to our PATH variable for which we will need super user privileges.

sudo su

Once we are the root user go ahead and copy paste the following commands

echo ‘export PATH=/usr/local/bin:$PATH’ >> /root/.bash_profilesource /root/.bash_profile

The commands executed above have added our installation folders to the PATH variable making it accessible throughout the system.

Step 3: Initialize Airflow:

Airflow requires a database to be initiated before you can run tasks. If you’re just experimenting and learning Airflow, you can stick with the default SQLite option. If you don’t want to use SQLite, then take a look at Initializing a Database Back end to setup a different database.

After configuration, you’ll need to initialize the database before you can run tasks:

airflow initdb

This step creates an airflow folder in your user’s space with all the essential config files.

Note: By Default, airflow uses port 8080 which might be currently in use by other webs servers.

To avoid this conflict, we will direct airflow to user 8081 by modifying the airflow.cfg file.

Location: /home/username/aiflow/

Substitute 8081 for 8080 in the following places:

Once done rerun command:

airflow initdb

If you see this screen then Airflow has been configured successfully.

Step 4: Increase Swap Space

Explanation: By Default swap space on these dist is 0 GB. When we run multiple instances of Airflow; swap space is necessary for these threads to run successfully.

One way to check if your instance has swap space:

free -h

Since my system has no swap space, I will walk through certain steps to allocate some memory for swapping.

Allocate Swap space:

sudo fallocate -l 6G /swapfilesudo chmod 600 /swapfile

Here I will be allocating 6 GB for swapping.

Let’s verify if our swapfile is created as expected.

ls -lh /swapfile

Let us now set the swap file as our swap space.

sudo mkswap /swapfile
sudo swapon /swapfile

Once done; we should now be able to see the swap space which is around 6 GB.

Our recent changes have enabled the swap file for the current session. However, if we reboot, the server will not retain the swap settings automatically. We can change this by adding the swap file to our /etc/fstab file.

Back up the /etc/fstab file in case anything goes wrong:

sudo cp /etc/fstab /etc/fstab.bakecho ‘/swapfile none swap sw 0 0’ | sudo tee -a /etc/fstab

We have now prepped our environment for a smooth start of our Airflow Server.

Step 5: Start Airflow Server

When we start Airflow; the server crawls the dag folder (we will need to create) to load all the user defined dags.

Let’s start by creating this folder:

Our updated folder structure looks as such:

To test our setup we will whip up a test dag that will print hello world.

Hello World DAG

Do not worry about the date as we will be triggering this DAG from the console.

Once created; copy this file to our dags folder.

Run

airflow webserver

Open another session and run

Airflow scheduler

Note: Actual work of crawling the folder and updating DAGs is done by the scheduler while the webserver handles the UI. Thus starting the scheduler is essential.

Viola!!! We are up and running!

Locate your dag and turn it on.

Click on play button to trigger it.Click on play button to trigger it.

Click on the dag name to see the task list

Airflow provides a Tree view and a Graph view of your dag which is pretty sleek!

We have specified a dummy operator following which Python operator will be executed to print Hello world to the console.

Green boxes signify that the tasks have been executed successfully.

Another way to ensure success is to check the log for the hello_task.

Logs can be accessed by clicking on the task:

Select View Log

That Says it all! It Works!

Configurations to Consider

Before we go on a victory lap here are some settings that we must consider for future use.

Certain out-of-box configurations of airflow may not suit your advanced needs which must be tinkered with in the airflow.cfg file.

Let’s take a look:

  • Executor: SequentialExecutor

This setting will allow us to run one task at a time in a sequence.

However, this might not work out in the long run when there is a need for parallelism.

  • Backend: SQLite

SequentialExecutor goes well with SQLite since it allows only one connection at a time.

For Parallelism we might want to consider using LocalExecutor or Celery that will leverage postgres as a backend engine.

In my next story I will walk through the process of installing postgres and configuring airflow to communicate with it.

--

--

Christopher Lagali
Christopher Lagali

Written by Christopher Lagali

An Enthusiastic Data Eng. who is on a mission to unravel the possibilities of pipeline building with AWS and who believes in knowledge sharing.

No responses yet