Directed Acyclic Graph (DAG) is a group of all individual tasks that we run in an ordered fashion. In other words, we can say that a DAG is a data pipeline in airflow. In a DAG:
- There is no loop
- Edges are directed
Key Terminologies:
- Operator: The task in your DAG is called an operator. In airflow, the nodes of the DAG can be called an operator
- Dependencies: The specified relationships between your operators are known as dependencies. In airflow, the directed edges of the DAG can be called dependencies.
- Tasks: Tasks are units of work in Airflow. Each task can be an operator, a sensor, or a hook.
- Task Instances: It is a run of a task at a point in time. These are runnable entities. Task Instances belong to a DagRun.
A Dag file is a python file that specifies the structure as well as the code of the DAG.
Steps To Create an Airflow DAG
- Importing the right modules for your DAG
- Create default arguments for the DAG
- Creating a DAG Object
- Creating tasks
- Setting up dependencies for the DAG
Now, let’s discuss these steps one by one in detail and create a simple DAG.
Step 1: Importing the right modules for your DAG
In order to create a DAG, it is very important to import the right modules that are needed in order to make sure, that we have imported all the modules, that we will be using in our code to create the structure of the DAG. The first and most important module to import is the “DAG” module from the airflow package that will initiate the DAG object for us. Then, we can import the modules related to the date and time. After that we can import the operators, we will be using in our DAG file. Here, we will be just importing the Dummy Operator.
# To initiate the DAG Object from airflow import DAG # Importing datetime and timedelta modules for scheduling the DAGs from datetime import timedelta, datetime # Importing operators from airflow.operators.dummy_operator import DummyOperator
Step 2: Create default arguments for the DAG
Default arguments is a dictionary that we pass to airflow object, it contains the metadata of the DAG. We can easily apply these arguments to as many operators, that we want.
Let’s create a dictionary named default_args
# Initiating the default_args
default_args = {
        'owner' : 'airflow',
        'start_date' : datetime(2022, 11, 12)
}
- the owner can be the owner of the DAG
- start_date is the date DAG starts getting scheduled
We can add more such parameters to our arguments, as per our requirement.
Step 3: Creating DAG Object
After the default_args, we have to create a DAG object, by passing a unique identifier, that we call “dag_id“, Here we can name it DAG-1.
So, let’s create a DAG Object.
# Creating DAG Object
dag = DAG(dag_id='DAG-1',
        default_args=default_args,
        schedule_interval='@once', 
        catchup=False
    )
Here,
- dag_id is the unique identifier for the DAG.
- schedule_interval is the time, how frequently our DAG will be triggered. It can be once, hourly, daily, weekly, monthly, or yearly. None means that we do not want to schedule our DAG and can trigger it manually.
- catchup – If we want to start executing the task from the current task, then we have to specify the catchup to be False. By default, catchup is True, which means that airflow will start running the tasks for all past intervals up to the current interval by default.
Step 4: Create tasks
A task is an instance of an operator. It has a unique identifier called task_id. There are various operators, but here, we will be using the DummyOperator. We can create various tasks using various operators. Here we will be creating two simple tasks:-
# Creating first task start = DummyOperator(task_id = 'start', dag = dag)
If you go to the graph view in UI, then you can see the task, “start” has been created.
 
# Creating second task end = DummyOperator(task_id = 'end', dag = dag)
Now, two tasks start and end will be created,
 
Step 5: Setting up dependencies for the DAG.
Dependencies are the relationship between the operators or the order in which the tasks in a DAG will be executed. We can set the order of execution by using the bitwise left or right operators to specify the downstream or upstream fashion respectively.
- a >> b means that first, a will run, and then b will run. It can also be written as a.set_downstream(b).
- a << b means that first, b will run which will be followed by a. It can also be written as a.set_upstream(b).
Now, let’s set up the order of execution between the start and end tasks. Here, let us suppose that we want to start to run first, and end running after that.
# Setting up dependencies start >> end # We can also write it as start.set_downstream(end)
Now, start and end after setting up dependencies:-
 
Putting all our code together,
# Step 1: Importing Modules
# To initiate the DAG Object
from airflow import DAG
# Importing datetime and timedelta modules for scheduling the DAGs
from datetime import timedelta, datetime
# Importing operators 
from airflow.operators.dummy_operator import DummyOperator
# Step 2: Initiating the default_args
default_args = {
        'owner' : 'airflow',
        'start_date' : datetime(2022, 11, 12),
}
# Step 3: Creating DAG Object
dag = DAG(dag_id='DAG-1',
        default_args=default_args,
        schedule_interval='@once', 
        catchup=False
    )
# Step 4: Creating task
# Creating first task
 start = DummyOperator(task_id = 'start', dag = dag)
# Creating second task 
 end = DummyOperator(task_id = 'end', dag = dag)
 # Step 5: Setting up dependencies 
start >> end 
Now, we have successfully created our first dag. We can move on to the webserver to see it in the UI.
 
Now, you can click on the dag and can explore different views of the DAG in the Airflow UI.


 
                                    







