In this article, we will see how we can massively reduce the execution time of a large code by parallelly executing codes in Python using the Joblib Module.
Introduction to the Joblib Module
Joblib module in Python is especially used to execute tasks parallelly using Pipelines rather than executing them sequentially one after another. Joblib module lets the user use the full potential of their devices by utilizing all the cores present in their device to make the process as fast as possible. Joblib also lets the user use the cached result from the last time by storing the result in cache memory, in this way the execution speed of any process can be minimized by a lot. Then also we can parallelly run multiple jobs at the same time, although the number of jobs that can be run parallelly is limited to the number of free cores available in the CPU at that time.
Not only the Joblib module can be used to dump and load different results, datasets, models, etc like the Pickle module from anywhere on the device, but we can also simply pass the path alongside the file name to load it or dump it. Joblib also provides a way to compress a huge dataset so that it would be easy to load and manipulate. Different available compression methods in the Joblib module are Zlib and LZ4, while dumping the dataset we need to mention the compression type as a parameter of the dump method of the Joblib module. The files will be stored with an extension of the compression we have used, it is .zlib for Zlib compression and .zl4 for Lz4 compression.
Prerequisite
The user must be familiar with Python, and knowledge about the concept of Multiprocessing is a bonus.
Required Modules
For this tutorial, we will need the joblib module alongside the time module and math module, write the below command to install it.
pip install joblib
time and math module comes pre-installed with Python so no need to install it externally.
Stepwise Implementation:
First, we will import our required classes from the joblib module and the time module.
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # Normal r = [math.factorial( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )] t2 = time.time() print (t2 - t1) |
Here we are importing the parallel and delayed classes of joblib module, then firstly we will check how much time that operation normally takes to execute. Here I have tried to first find out the square root of the cube of each number from 100 to 999 then find their factorial, user may try any other operation but make it as much complex as possible for better results.
Output:
Now we will reduce this time as much as possible using the Parallel and delayed functions of difflib module.
Using 2 cores
Using the Parallel function, we will use 2 cores to execute this code and delay the factorial function.
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # 2 Core r1 = Parallel(n_jobs = 2 )(delayed(math.factorial) ( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )) t2 = time.time() print (t2 - t1) |
Here the function Parallel takes an argument n_jobs in which we have to pass the number of cores we want to use, or how many pipelines will we use to execute the code parallelly. After that, we are delaying the math.factorial function so that it works parallelly with every pipeline, then we are passing the main operation. Remember to always use the most outside function inside delay, for this example if we use math.sqrt inside delay then this will not give a better result as we are using the result of sqrt to do something else.
Output:
Using 3 Cores
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # 3 Core r1 = Parallel(n_jobs = 3 )(delayed(math.factorial) ( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )) t2 = time.time() print (t2 - t1) |
Output:
Using 4 Cores
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # 4 Core r1 = Parallel(n_jobs = 4 )(delayed(math.factorial) ( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )) t2 = time.time() print (t2 - t1) |
Output:
Using 5 Cores
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # 5 Core r1 = Parallel(n_jobs = 5 )(delayed(math.factorial) ( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )) t2 = time.time() print (t2 - t1) |
Output:
Using 6 Cores
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # 6 Core r1 = Parallel(n_jobs = 6 )(delayed(math.factorial) ( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )) t2 = time.time() print (t2 - t1) |
Output:
Using 7 Cores
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # 7 Core r1 = Parallel(n_jobs = 7 )(delayed(math.factorial) ( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )) t2 = time.time() print (t2 - t1) |
Output:
Using 8 Cores
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # 8 Core r1 = Parallel(n_jobs = 8 )(delayed(math.factorial) ( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )) t2 = time.time() print (t2 - t1) |
Output:
Using all cores
Now if someone doesn’t know how many cores their device has but wants to use the maximum core possible then they will provide -1 as the value of n_jobs parameter.
Python3
import time from joblib import Parallel,delayed import math t1 = time.time() # Max Core r1 = Parallel(n_jobs = - 1 )(delayed(math.factorial) ( int (math.sqrt(i * * 3 ))) for i in range ( 100 , 1000 )) t2 = time.time() print (t2 - t1) |
Output –