Introduction
Machine Learning models require large datasets to get high accuracy, so in order to train a machine learning model with a large-size dataset, we also need a reasonable amount of time. So we use the joblib library to get rid of training the model again and again, instead, what we do is just train the model once and then save it using the joblib library, and then we use the same model.
This post will look at using Python’s joblib package to save and load machine learning models. For this project, Google Colab is used.
Joblib is a Python library for running computationally intensive tasks in parallel. It provides a set of functions for performing operations in parallel on large data sets and for caching the results of computationally expensive functions. Joblib is especially useful for machine learning models because it allows you to save the state of your computation and resume your work later or on a different machine.
Learning Objectives
- Understanding the importance of the Joblib library and why saving our machine learning models is useful.
- How to use the joblib library for saving and loading our trained machine learning model?
- Understanding the different functions that are used to save and load models, including functions like “dumb” and “load.”
This article was published as a part of the Data Science Blogathon.
Table of Contents
Why Should you Use Joblib?
Compared to other techniques of storing and loading machine learning models, using Joblib has a number of benefits. Since data is stored as byte strings rather than objects, it may be stored quickly and easily in a smaller amount of space than traditional pickling. Moreover, it automatically corrects errors when reading or writing files, making it more dependable than manual pickling. Last but not least, using joblib enables you to save numerous iterations of the same model, making it simpler to contrast them and identify the most accurate one.
Joblib enables multiprocessing across several machines or cores on a single machine, which enables programmers to parallelize jobs across numerous machines. This makes it simple for programmers to utilize distributed computing resources like clusters or GPUs to accelerate their model training process.
Import Joblib
Import joblib using the following code:
# importing the joblib libraray
import joblib
If the above code gives an error, you don’t have joblib installed in your environment.
install joblib using the following code:
!pip install joblib
Make a Machine Learning Model
We will make a logistic regression model for this purpose and use the iris dataset present in sklearn. datasets.
The Iris dataset is a well-known dataset in the field of machine learning and statistics. It contains 150 observations of iris flowers and the measurements of their sepals and petals. The dataset includes 50 observations for each of three species of iris flowers (Iris setosa, Iris virginica, and Iris versicolor). The measurements included in the dataset are sepal length, sepal width, petal length, and petal width. The Iris dataset is commonly used as a benchmark for classification algorithms as it is small, well-understood, and multi-class.
Logistic Regression is a type of statistical method used for binary classification problems. It is used to model the relationship between a dependent variable and one or more independent variables. Logistic regression aims to estimate the probability of an event occurring based on the values of the independent variables. The output of logistic regression is a probability between 0 and 1, which can then be thresholded to make a binary decision about the class of the event. Logistic regression is widely used in various fields, including medicine, marketing, and finance, due to its simplicity, interpretability, and ability to handle various data types and distributions. Despite its simplicity, logistic regression is a powerful tool for solving many binary classification problems and is often a good starting point for more complex machine learning models.
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
# load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# fit a linear regression model
reg = linear_model.LogisticRegression()
reg.fit(X_train, y_train)#import csv
Saving the Model Using Joblib
Saving our trained machine learning model using the dump function of the joblib library.
# save the model to a file
joblib.dump(reg, 'regression_model.joblib')
# the First parameter is the name of the model and the second parameter is the name of the file
# with which we want to save it
# now the model named 'reg' will be saved as 'regression_model.joblib' in the current
# directory.
Below given image show the current working directory before saving the model using the joblib library.
Below is the screenshot after saving the model using the joblib dump method.
You can clearly notice that after running the code: joblib.dump(reg, ‘regression_model.joblib’), a new file has been saved in the current directory as ‘regression_model.joblib’.
Loading the Saved Model Using Joblib
Loading the regression_model.joblib for using it for making predictions.
# load the saved model
reg = joblib.load('regression_model.joblib')
Make Predictions Using the Loaded Model
Making predictions for the test dataset using our trained ML model.
# use the loaded model to make predictions
predictions = reg.predict(X_test)
predictions
#import csv
Output:
Joblib library is very useful when we want to use machine learning models in applications and websites.
Joblib can be useful in development in several ways:
1. Debugging and Profiling: It might be challenging to identify which sections of code are taking the longest to execute when creating a large application with several functions. Joblib offers simple tools for profiling the performance of your code, allowing you to locate and speed up the areas of your application that are the slowest.
2. Reproducibility: Doing the same calculations several times might be time-consuming when working with huge datasets. In order to reuse the results of time-consuming computations without having to run the code again, Joblib offers a means to cache the results. By doing this, you can save time and guarantee the reproducibility of your results.
3. Testing: Writing tests is crucial when creating a complex program since they ensure that the code performs as intended. Joblib offers a means to run tests concurrently so you can learn more quickly about the state of your code. This can speed up your development process and enable you to write and execute more tests in less time.
4. Experimentation: Running several iterations of the code simultaneously can be useful when creating a new algorithm or testing out various strategies. Joblib offers a straightforward method for running various iterations of your code concurrently so you can rapidly compare their outcomes and determine which strategy works best.
Conclusion
In conclusion, Joblib can be helpful in development by offering instruments for debugging and profiling, ensuring repeatability, accelerating the testing procedure, and enabling experimentation. With the aid of these features, you can create more substantial and intricate apps with greater productivity and efficiency.
The key takeaways of this article are as follows:
- It offers Python-based machine learning frameworks like scikit-learn and TensorFlow developers an effective approach to instantly save and load their learned models without having to redo the time-consuming and expensive process of training from scratch each time they require them.
- It also allows developers to take advantage of parallelization techniques such as multiprocessing across multiple machines or cores on a single machine making higher performance levels achievable at lower costs.
- So if you’re looking for an easy way to optimize your model creation and storage processes in Python, look no further than JobLib!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.