Sunday, November 17, 2024
Google search engine
HomeLanguagesIntroduction to Dask in Python

Introduction to Dask in Python

Dask is a library that supports parallel computing in python. It provides features like-

  1. Dynamic task scheduling which is optimized for interactive computational workloads
  2. Big data collections of dask extends the common interfaces like NumPy, Pandas etc.

Why Dask?

Most of the BigData analytics will be using Pandas, NumPy for analyzing big data. All the mentioned packages support a wide variety of computations. But when the dataset doesn’t fit in the memory these packages will not scale. Here comes dask. When the dataset doesn’t “fit in memory” dask extends the dataset to “fit into disk”. Dask allows us to easily scale out to clusters or scale down to single machine based on the size of the dataset. 

Installation

To install this module type the below command in the terminal – 

python -m pip install "dask[complete]" 

Let’s see an example comparing dask and pandas. To download the dataset used in the below examples, click here.

1. Pandas Performance: Read the dataset using pd.read_csv()

Python3




import pandas as pd
  
%time temp = pd.read_csv('dataset.csv',
                          encoding = 'ISO-8859-1')


Output:

CPU times: user 619 ms, sys: 73.6 ms, total: 692 ms

Wall time: 705 ms

2. Dask Performance: Read the dataset using dask.dataframe.read_csv

Python3




import dask.dataframe as dd
  
%time df = dd.read_csv("dataset.csv"
                        encoding = 'ISO-8859-1')


Output:

CPU times: user 21.7 ms, sys: 938 µs, total: 22.7 ms

Wall time: 23.2 ms

Now a question might arise that how large datasets were handled using pandas before dask? There are few tricks handled to manage large datasets in pandas.

  1. Using chunksize parameter of read_csv in pandas
  2. Use only needed columns while reading the csv files

The above techniques will be followed in most cases while reading large datasets using pandas. But in some cases, the above might not be useful at that time dask comes into play a major role.

Limitations of dask

There are certain limitations in dask.

  1. Dask cannot parallelize within individual tasks.
  2. As a distributed-computing framework, dask enables remote execution of arbitrary code. So dask workers should be hosted within trusted network only.
RELATED ARTICLES

Most Popular

Recent Comments