05.03.2018Blog

Apache Airflow – why everyone working on data domain should be interested of it?

At some point in your profession, you must have seen a data platform where Windows Task Scheduler, crontab, ETL -tool or cloud service starts data transfer or transformation scripts independently, apart from other tools and according to the time on the wall. Your scripts presumably operate just fine most of the time, but adding new material or just understanding the whole extent of your data pipelines is sometimes surprisingly hard. Understanding dependencies and controlling data pipelines is also profoundly troublesome. You can only dream of Data Lineage when you can’t even have decent history data or success rate of your current data pipeline executions (unless you count email notifications as such). Keep reading if all of this sounds even remotely familiar as Apache Airflow might solve your obstacles as efficiently as ice hockey player Patrik Laine does in a man advantage below the left faceoff circle.

The issue pictured above is no way new and there actually have been numerous solutions at market meant to solve the matter. Outlining Enterprise -level solutions, there are few good competitors at open source market at this moment created to manage Hadoop -jobs; Oozie by Apache, Azkaban by LinkedIn and Luigi from Spotify. Solutions which can handle other than pure Hadoop -jobs haven’t existed until now and Apache Airflow attempts to fill the gap.

What the is a DAG?

Before I start explaining how Airflow works and where it originated from, the reader should understand what is a DAG. DAG is abbreviation from “Directed acyclic graph” and according to Wikipedia means “a finite directed graph with no directed cycles. That is, it consists of finitely many vertices and edges, with each edge directed from one vertex to another, such that there is no way to start at any vertex v and follow a consistently-directed sequence of edges that eventually loops back to v again.”. Sounds basically non-sense at first reading, right? In this case, a picture is actually worth of thousand words.

Figure 1 – Simple DAG. Each of the boxes represent independent DAG task which could be anything from bash command to SQL or Hive -query.

Basically, DAG is a tree-like structure where you can define dependencies between various tasks and those tasks can branch out in a way where they will be executed regardless of the outcome of the other branch. To make things clear, this means that you can set dependencies between cron -jobs in an away where they could start or choose not start regardless of the outcome of other cron -jobs.

When Maxim Beauchemin had his stroke of genius

You can find several blog posts from recent years where people are praising a software called Airflow created at Airbnb by Maxime Beauchemin. Before I introduce Airflow, I want to spend a short time to mention two great blog posts written by Maxime. “The Rise of Data Engineer” and its follow-up “The Downfall of Data Engineer” picture really good what it means to work with data nowadays. I also recommend bookmarking Data Engineer -toolset list found from Maxime’s GitHub -profile.

As one might guess, Airflow was developed internally at Airbnb to solve difficulties described at the beginning of this blog post. There wasn’t any software at market suitable handling the growing amounts of data pipelines generated at Airbnb so they built it themselves. Starting in September 2017, Airflow was added under Apache umbrella of open source software stack with Maxime still at the helm.

Airflow attempts to solve the issue of managing, maintaining and handling data pipelines by creating DAGs from the data pipelines. DAGs are written in Python so any feature required is added just by importing required library and adding wanted tasks inside the DAG. You can have an infinite number of tasks inside the DAG and dependencies between tasks are described in the code in really simple manner.

This, for example, is a DAG in its simplistic form; A simple workflow where I utilize sqlcmd and few premade SQL backup and restore procedures to copy existing production database into the test environment. At bottom of the DAG, you can see how easily you can define dependencies between tasks.

It may seem challenging at first, but even with limited experience, you can start really quickly just by looking at the examples provided by Airflow developers.

What’s under the hood?
Airflow is made from simple gunicorn -webserver, scalable scheduler (which is the heart of Airflow), workers executing the tasks and metadata database which knows the status of each DAG and its execution history. From the web UI, you start and control the DAG runs and also see the current status of all your DAGs.

Figure 2 – Example stage loads from our own Agile Data Engine demo -environment, automatically drawn by the Landing Times -tab.

Web UI has several built-in views, where the simplest one is the basic traffic light -view showing the current status of each DAG. On top this the web UI offers ready-made Gantt, dependency and landing time -chart where execution history of each DAG is constantly updated.

Figure 3 – Example stage loads from our own Agile Data Engine demo -environment, automatically drawn by the Task Duration -tab.

Figure 4 – Example stage loads from our own Agile Data Engine demo -environment, automatically drawn by the Graph View.

Sounds too good to be true for an open-source software? Well, as you can see from the images above it’s true as the fact that currently, Lauri Markkanen is currently averaging more 3 pointers made per game than any rookie. Ever.

Say goodbye to line drawing and say hello to automatically generated Python -code

The fact that DAGs are based on Python-code is the neatest feature of Airflow. You can automate code generation, save it into version control and see what’s really happening in the data pipeline just by looking at the actual code itself. Metadatabase is just for saving the execution history so you don’t need to even worry about the performance concerns caused by multiple DAGs running simultaneously. Python -code also makes it possible you to write your own libraries or executors and add them without you requiring to wait for the fixed software release cycle set up by the current software vendors.

DAGs also solve many headaches which are currently fixed by creating separate verification jobs or by building several, separate tasks. What I mean by this is that you can have source staging load which triggers four separate tasks which eventually trigger even more tasks. In Airflow you can build a DAG where one failing task does not invalidate the whole data pipeline. Instead, only the tasks directly linked to that failing task are invalided. Building and maintaining something like this in cron would be a nightmare.

Is that stuff ready for production?

Shortly: yes, if you are willing to get your hands dirty.

I’ve been using Airflow with my co-workers slightly over a year now and we’ve moved into production mode because we believe in the product. After the tough times at the beginning we have set up our environment to be solid as a rock and we’re finding new capabilities coming from the Airflow developers quarterly. What’s most important though is that all the needed connection hooks are already in place.

At this moment, we can produce most of our ETL -data pipelines as a code and control various source and destination systems under one view. Support for Windows -operating system is still under development, but thanks to Python -code we ‘re able to build our own way of connecting to Windows -servers by using Kerberos authentication and AD -accounts.

Of course, you can see from the documentation and in the lack of bells and whistles that Airflow is still open source -software under development. That being said, I still say that the features included are so convincing that they make up the lacking parts. The missing parts are actually shown in a really excellent manner at the Airflow roadmap. Airflow does not support multi-tenant architecture or neither you can control the visibility of DAGs in any way. If you want to have highly available architecture, you need to design and built it yourself. The Airflow scheduler also needs improvement as currently, it’s really hard to understand how it functions.

I suggest each one of the readers try out Airflow in their own VirtualBox installation or in an EC2 -instance. Installing Airflow is a really simple process and within minutes you have a working demo -environment. If after all this you still have time, I recommend looking into Maxime’s second project: Apache Superset. Apache Superset is an open -source visualization product with the idea of introducing enterprise-level features into open source -world. At least I’m excited about the software and plan to follow up how the Superset evolves.

Mika Heino works in the Data team at Solita. For those you read to this point, can you ask from mister Ilkka Suomaa, next time you see him, where his blog about the week of sales rep is pending?