My colleague covered the basic concepts around Apache Airflow and summed experiences of its use in data warehousing solutions in his Apache Airflow – why everyone working on data domain should be interested of it? article. In this article, I will cover our recent experiences of using Airflow. I will also provide some practical examples of its benefits and of the challenges we have encountered on our journey towards better workflow management and sustainable maintenance practices.
Where did we start?
We were in somewhat challenging situation in terms of daily maintenance when we began to adopt Airflow in our project. Data warehouse loads and other analytical workflows were carried out using several ETL and data discovery tools, located in both, Windows and Linux servers.
In addition, these were also orchestrated and scheduled using several different tools, such as SQL Server Agent, Crontab and even Windows Scheduler. Many of the workflows were built using master-type job structures, with the dependencies between individual tasks hidden deep in complicated ETL structures.
To simplify (and slightly oversimplify):
- We didn’t have a consistent view of the results of the latest workflows.
- Logs required in debugging were located on several servers and in multiple applications.
- Basis for any kind of automation was minimal to non-existent.
- Error logging was very generic, with no easy way of knowing what was the root cause.
- Restarting a single task run was difficult, without visibility into the jobs logic.
- Dependencies between the tasks were hidden and the master workflows had created additional and unnecessary dependencies.
- To have a complete monitoring environment, we would have needed to build a separate environment for data collection and visualisation.
One of the challenges we knew we were about to face was that we were implementing Airflow in an Enterprise Data Warehouse (EDW) environment, where the orchestration and the dependency management needs are slightly different than in most of the examples we encountered. In these examples, the DAGs were usually point solutions containing all the logic for one specific business problem. However, in an EDW solution, the dependencies between the DAGs become increasingly complex with the addition of new source systems thus creating its own set of challenges in creating optimal DAGs.
Once the Airflow environment was up and running, we immediately began to generate DAGs automatically. DAGs can also be scripted manually, but you will soon realise that the same bits of code are repeated over and over again, and that generation using metadata is the only sensible way to proceed.
Solita’s Agile Data Designer and the data warehouse modelling method we used automatically generated a good metadata repository for us to rely on. This enabled us to focus on the automatic generation of DAGs, without the need to spend time on collecting metadata. To generate DAGs automatically, we scripted a very simple DAG generator with Python that was utilizing the metadata repository. After having reached our desired level of generation, which was very minimal at first, we began to populate extracts and loads from the source systems to the data warehouse.
As the solution preceding Airflow combined a lot tasks and logic in to a single workflow, we initially wanted to take a opposite approach and create only minimum dependencies within the DAGs. This meant that a single DAG would only have one stage load and an indefinite number of data warehouse loads related to the stage table. The generator we had created enabled us to create the necessary DAGs very quickly, and couple days later we had dozens of them. But this was when problems began to emerge.
The ‘source system to the data warehouse’ DAGs were always based on the same simple pattern and were very easy to generate and manage. However, things became complicated when we began to load data from the EDW to the publish area, to facts and dimensions, for example. In case of a fact-table coming from a single source system the entire logic could be within a single DAG. However, as soon as there would be more facts and dimensions having dependencies on several sources, we could end up in a situation where the entire EDW would be loaded using a single DAG!
For this reason, we decided to separate the source-to-dw DAGs and the publish area DAGs from one another. This also provides the benefit that the data warehouse and publish area can be loaded using different cycles, if necessary.
The initial plan was to use Airflow’s sensors to break down and handle the dependencies. The ExternalTaskSensor allows us to poke the status of a task in another DAG and start a task in the dependent DAG once the sensor indicates that the required task has been successful. We started to generate the publish area DAGs utilizing the sensor, only to realise that we had missed one crucial point when using sensors: resource management.
As the number of DAGs and sensors within them increased, we found ourselves in a situation where the sensors were waiting for tasks in DAGs that could never be completed, as the sensors had reserved all the resources used to run DAGs. Combined with the fact that we had already generated nearly 100 small DAGs at this point causing Airflow’s scheduler to cry for help and stall. We had to take a few steps back and rethink this.
As mentioned in the beginning, one of the main reasons to start using Airflow was to get rid of big master jobs that combined and usually hid a lot the workflow logic within them. However eventually we had to compromise. We decided to collect all the stage-to-dw loads that start at the same time in to a one single DAG. In practice this meant that there would be a one DAG per source system. Fortunately, with Airflow, this is a lesser problem as Airflow offers excellent visibility into everything that is happening within a DAG, for example, errors are very easy to detect and report forward, in our case to Slack.
This approach significantly reduced the overall number of DAGs and as a result, the Airflow scheduler now works much better. We continue to keep publish area loads in their specific DAGs, but we have also reduced the number of publish DAGs by combining loads of the single data mart into a single DAG.
We have not yet reached a situation where we are able to start publish-area DAGs immediately after its dependent stage and dw tasks have been completed and we have scheduled them separately. One solution is to create a custom operator to check whether the necessary tasks have been completed and whether the DAG can be started.
We have also brought Jenkins alongside Airflow. It makes it easier to manage the automatic generation of DAGs, release them between environments and automatically manage versioning.
When comparing with the list introduced at the beginning of this article, this is how the situation has changed:
- All data warehouse and analytics workflows are scheduled and managed in the one place, with easy access to a comprehensive view of the status.
- Execution logs are centralized to single location.
- With Airflow’s Configuration as a Code approach, we have been able to automate the generation of workflows to a very high degree.
- If a DAG fails, we receive a comprehensive error message in Slack immediately.
- Metadata enables us to regenerate individual uploads, if necessary.
- We have clear view the dependencies directly within the DAGs.
- Airflow contains out-of-the box charts that make it easy to examine run times and the ordering of the executions. Using Airflow’s own database and Grafana, for example, it’s easy to create more visualisations for more comprehensive monitoring.
Our implementation of Apache has absolutely paid off. The increased automation and visibility enables us to shift our focus from building data pipelines and debugging towards aspects that are genuinely beneficial for our customers’ business operations, as well as being more meaningful for us.