Airflow with Google BigQuery and Slack

Sophia Yang, Ph.D.
5 min readDec 10, 2018

Airflow is an open source tool for creating, scheduling, and monitoring data processing pipelines. Airflow can be installed via conda install -c conda-forge airflow or pip install airflow. Before running airflow, we need to initiate the database airflow initdb.

Then we can start the airflow webserver, which a python flask app providing the UI of airflow. The default port of the webserver is 8080: airflow webserver -p 8080. When we open 0.0.0.0:8080 in the browser, we can see a bunch of airflow examples/DAGs.

DAGs (directed acyclic graphs) represent workflows of multiple tasks with some sort of dependencies among them. The Graph View of a DAG usually shows clearly the logics of the workflow.

In this example below, I’m interested in working with a BigQuery public dataset, San Francisco 311 service requests data, and counting how many requests were reported to each agency on a daily basis. Of course, we can run this query directly on BigQuery manually. However, with airflow, we can save the daily count data automatically and never need to worry about it. Here I am saving daily output to a csv file. And there are a lot of things we can do to add onto the daily airflow schedule. Here I am just demonstrating this simple daily aggregation task.

To start, let’s import some useful python libraries.

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import datetime
from datetime import timedelta
import os, errno
from…

--

--

No responses yet