Sunday, November 17, 2024
Google search engine
HomeLanguagesCatBoost in Machine Learning

CatBoost in Machine Learning

We often encounter datasets that contain categorical features and to fit these datasets into the Boosting model we apply various encoding techniques to the dataset such as One-Hot Encoding or Label Encoding. But applying One-Hot encoding creates a sparse matrix which may sometimes lead to the overfitting of the model to handle this issue we use CatBoost. CatBoost automatically handles categorical features.

What is CatBoost 

CatBoost or Categorical Boosting is an open-source boosting library developed by Yandex. It is designed for use on problems like regression and classification having a very large number of independent features. 

Catboost is a variant of gradient boosting that can handle both categorical and numerical features. It does not require any feature encodings techniques like One-Hot Encoder or Label Encoder to convert categorical features into numerical features. It also uses an algorithm called symmetric weighted quantile sketch(SWQS) which automatically handles the missing values in the dataset to reduce overfitting and improve the overall performance of the dataset. 

Features of CatBoost 

  • Built-in Method for handling categorical features – CatBoost can handle categorical features without any feature encoding 
  • Built-in methods for Handling missing values –  Unlike other Models, CatBoost can easily handle any missing values in the  dataset
  • Automatica feature scaling – CatBoost internal scales all the columns to the same scaling whereas in other models we need to convert columns extensively 
  • Built-in cross-validation – CatBoost internally applies a cross-validation method to choose the best hyperparameters for the model.
  •  Regularizations – CatBoost supports both L1 and L2 regularization methods to reduce overfitting 
  • It can be used in both Python and R language. 

CatBoost Comparison results with other Boosting Algorithm

Default CatBoost Tuned CatBoost Default LightGBM Tuned LightGBM Default XGBoost Tuned XGBoost Default H2O
Adult 0.272978 (±0.0004) (+1.20%) 0.269741 (±0.0001) 0.287165 (±0.0000) (+6.46%) 0.276018 (±0.0003) (+2.33%) 0.280087 (±0.0000) (+3.84%) 0.275423 (±0.0002) (+2.11%)
Amazon 0.138114 (±0.0004) (+0.29%) 0.137720 (±0.0005) 0.167159 (±0.0000) (+21.38%) 0.163600 (±0.0002) (+18.79%) 0.165365 (±0.0000) (+20.07%) 0.163271 (±0.0001) (+18.55%)
Appet 0.071382 (±0.0002) (-0.18%) 0.071511 (±0.0001) 0.074823 (±0.0000) (+4.63%) 0.071795 (±0.0001) (+0.40%) 0.074659 (±0.0000) (+4.40%) 0.071760 (±0.0000) (+0.35%)
Click 0.391116 (±0.0001) (+0.05%) 0.390902 (±0.0001) 0.397491 (±0.0000) (+1.69%) 0.396328 (±0.0001) (+1.39%) 0.397638 (±0.0000) (+1.72%) 0.396242 (±0.0000) (+1.37%)
Internet 0.220206 (±0.0005) (+5.49%) 0.208748 (±0.0011) 0.236269 (±0.0000) (+13.18%) 0.223154 (±0.0005) (+6.90%) 0.234678 (±0.0000) (+12.42%) 0.225323 (±0.0002) (+7.94%)
Kdd98 0.194794 (±0.0001) (+0.06%) 0.194668 (±0.0001) 0.198369 (±0.0000) (+1.90%) 0.195759 (±0.0001) (+0.56%) 0.197949 (±0.0000) (+1.69%) 0.195677 (±0.0000) (+0.52%)
Kddchurn 0.231935 (±0.0004) (+0.28%) 0.231289 (±0.0002) 0.235649 (±0.0000) (+1.88%) 0.232049 (±0.0001) (+0.33%) 0.233693 (±0.0000) (+1.04%) 0.233123 (±0.0001) (+0.79%)
Kick 0.284912 (±0.0003) (+0.04%) 0.284793 (±0.0002) 0.298774 (±0.0000) (+4.91%) 0.295660 (±0.0000) (+3.82%) 0.298161 (±0.0000) (+4.69%) 0.294647 (±0.0000) (+3.46%)
Upsel 0.166742 (±0.0002) (+0.37%) 0.166128 (±0.0002) 0.171071 (±0.0000) (+2.98%) 0.166818 (±0.0000) (+0.42%) 0.168732 (±0.0000) (+1.57%) 0.166322 (±0.0001) (+0.12%)

CatBoost Installation

CatBoost is an open-source library that does not comes pre-installed with Python, so before using CatBoost we must install it in our local system.

For installing CatBoost in Python 

pip install catboost

For Installing CatBoost In R

install.packages("catboost")

Python Implementation of CatBoost 

We will use Python to apply CatBoost to Machine learning project problems. The dataset for the project data can be found here. In this problem, we are given a dataset containing 3 species of flowers and the features of these flowers such as sepal length, sepal width, petal length, and petal width, and we have to classify the flowers into these species. 

Importing libraries For CatBoost

After installing CatBoost in our local system, We will import it along with other Python necessary libraries that is needed for this project. 

Python3




import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings("ignore")


Reading  And Describing The Dataset 

 After, importing the libraries we will load our dataset using the pandas read_csv method as:

Python3




# Reading the dataset from the csv file
# separator is a vertical line, as seen in the dataset
data = pd.read_csv("Iris.csv")
  
# Printing the shape of the dataset
print(data.shape)


Output:

(150, 6)

Our dataset has 150 rows and 6 columns. Let’s explore the dataset content using the head() method as follows:

Python3




data.head()


Output:

  Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm  Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa

Dropping ID Column and Separating Target Variable from The Dataset

The first column is the Id column which has no relevance to flowers so, we will drop it using drop() function. The Species column is our target feature and tells us about the species the flowers belong to. We will separate it using pandas iloc slicing.

Python3




data = data.drop('Id', axis=1)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
print("Shape of X is %s and shape \
    of y is %s" % (X.shape, y.shape))


Shape of X is (150, 4) and shape of y is (150,)

Unique Values in Our Dependent Variable 

Since this is a classification task we may want to determine the total number of unique categories in our dependent variable. 

Python3




total_classes = y.nunique()
print("Number of unique species in dataset are: ",total_classes)


Output :

Number of unique species in dataset are: 3

There are 3 unique classes in our dependent variable, We may want to see the count of these unique classes to check the balance in our dataset. 

Python3




distribution = y.value_counts()
print(distribution)


Output:

Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: Species, dtype: int64

Let’s dig deep into our dataset, and we can see in the above image that our dataset contains 3 classes into which our flowers are distributed also since we have 150 samples all three species have an equal number of samples in the dataset, so we have no class imbalance.

Splitting The Dataset 

Now, we will split the dataset for training and validation purposes, the validation set is 25% of the total dataset. For dividing the dataset into training and testing we will use train_test_split method from the sklearn model selection.  

Python3




X_train, X_val, Y_train, Y_val = train_test_split(
    X, y, test_size=0.25, random_state=28)


Applying CatBoost to The Data 

Python3




# Define the hyperparameters for the CatBoost algorithm
params = {'learning_rate': 0.1, 'depth': 6,\
          'l2_leaf_reg': 3, 'iterations': 100}
  
# Initialize the CatBoostClassifier object 
# with the defined hyperparameters and fit it on the training set
model = CatBoostClassifier(**params)
model.fit(X_train, Y_train)


Accuracy Of the CatBoost Model 

Python3




# Predict the target variable on the validation
# set and evaluate the performance
y_pred = model.predict(X_val)
accuracy = (y_pred == np.array(Y_val)).mean()
print("Validation Accuracy:", accuracy)


Output:

Validation Accuracy: 0.33518005540166207

RELATED ARTICLES

Most Popular

Recent Comments