CatBoost in Machine Learning

27 July 2024

0

We often encounter datasets that contain categorical features and to fit these datasets into the Boosting model we apply various encoding techniques to the dataset such as One-Hot Encoding or Label Encoding. But applying One-Hot encoding creates a sparse matrix which may sometimes lead to the overfitting of the model to handle this issue we use CatBoost. CatBoost automatically handles categorical features.

What is CatBoost

CatBoost or Categorical Boosting is an open-source boosting library developed by Yandex. It is designed for use on problems like regression and classification having a very large number of independent features.

Catboost is a variant of gradient boosting that can handle both categorical and numerical features. It does not require any feature encodings techniques like One-Hot Encoder or Label Encoder to convert categorical features into numerical features. It also uses an algorithm called symmetric weighted quantile sketch(SWQS) which automatically handles the missing values in the dataset to reduce overfitting and improve the overall performance of the dataset.

Features of CatBoost

Built-in Method for handling categorical features – CatBoost can handle categorical features without any feature encoding
Built-in methods for Handling missing values – Unlike other Models, CatBoost can easily handle any missing values in the dataset
Automatica feature scaling – CatBoost internal scales all the columns to the same scaling whereas in other models we need to convert columns extensively
Built-in cross-validation – CatBoost internally applies a cross-validation method to choose the best hyperparameters for the model.
Regularizations – CatBoost supports both L1 and L2 regularization methods to reduce overfitting
It can be used in both Python and R language.

CatBoost Comparison results with other Boosting Algorithm

Default CatBoost	Tuned CatBoost	Default LightGBM	Tuned LightGBM	Default XGBoost	Tuned XGBoost	Default H2O
Adult	0.272978 (±0.0004) (+1.20%)	0.269741 (±0.0001)	0.287165 (±0.0000) (+6.46%)	0.276018 (±0.0003) (+2.33%)	0.280087 (±0.0000) (+3.84%)	0.275423 (±0.0002) (+2.11%)
Amazon	0.138114 (±0.0004) (+0.29%)	0.137720 (±0.0005)	0.167159 (±0.0000) (+21.38%)	0.163600 (±0.0002) (+18.79%)	0.165365 (±0.0000) (+20.07%)	0.163271 (±0.0001) (+18.55%)
Appet	0.071382 (±0.0002) (-0.18%)	0.071511 (±0.0001)	0.074823 (±0.0000) (+4.63%)	0.071795 (±0.0001) (+0.40%)	0.074659 (±0.0000) (+4.40%)	0.071760 (±0.0000) (+0.35%)
Click	0.391116 (±0.0001) (+0.05%)	0.390902 (±0.0001)	0.397491 (±0.0000) (+1.69%)	0.396328 (±0.0001) (+1.39%)	0.397638 (±0.0000) (+1.72%)	0.396242 (±0.0000) (+1.37%)
Internet	0.220206 (±0.0005) (+5.49%)	0.208748 (±0.0011)	0.236269 (±0.0000) (+13.18%)	0.223154 (±0.0005) (+6.90%)	0.234678 (±0.0000) (+12.42%)	0.225323 (±0.0002) (+7.94%)
Kdd98	0.194794 (±0.0001) (+0.06%)	0.194668 (±0.0001)	0.198369 (±0.0000) (+1.90%)	0.195759 (±0.0001) (+0.56%)	0.197949 (±0.0000) (+1.69%)	0.195677 (±0.0000) (+0.52%)
Kddchurn	0.231935 (±0.0004) (+0.28%)	0.231289 (±0.0002)	0.235649 (±0.0000) (+1.88%)	0.232049 (±0.0001) (+0.33%)	0.233693 (±0.0000) (+1.04%)	0.233123 (±0.0001) (+0.79%)
Kick	0.284912 (±0.0003) (+0.04%)	0.284793 (±0.0002)	0.298774 (±0.0000) (+4.91%)	0.295660 (±0.0000) (+3.82%)	0.298161 (±0.0000) (+4.69%)	0.294647 (±0.0000) (+3.46%)
Upsel	0.166742 (±0.0002) (+0.37%)	0.166128 (±0.0002)	0.171071 (±0.0000) (+2.98%)	0.166818 (±0.0000) (+0.42%)	0.168732 (±0.0000) (+1.57%)	0.166322 (±0.0001) (+0.12%)

CatBoost Installation

CatBoost is an open-source library that does not comes pre-installed with Python, so before using CatBoost we must install it in our local system.

For installing CatBoost in Python

pip install catboost

For Installing CatBoost In R

install.packages("catboost")

Python Implementation of CatBoost

We will use Python to apply CatBoost to Machine learning project problems. The dataset for the project data can be found here. In this problem, we are given a dataset containing 3 species of flowers and the features of these flowers such as sepal length, sepal width, petal length, and petal width, and we have to classify the flowers into these species.

Importing libraries For CatBoost

After installing CatBoost in our local system, We will import it along with other Python necessary libraries that is needed for this project.

Python3

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings("ignore")

Reading And Describing The Dataset

After, importing the libraries we will load our dataset using the pandas read_csv method as:

Python3

# Reading the dataset from the csv file
# separator is a vertical line, as seen in the dataset
data = pd.read_csv("Iris.csv")
  
# Printing the shape of the dataset
print(data.shape)

Output:

(150, 6)

Our dataset has 150 rows and 6 columns. Let’s explore the dataset content using the head() method as follows:

Python3

data.head()

Output:

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species

0 1 5.1 3.5 1.4 0.2 Iris-setosa

1 2 4.9 3.0 1.4 0.2 Iris-setosa

2 3 4.7 3.2 1.3 0.2 Iris-setosa

3 4 4.6 3.1 1.5 0.2 Iris-setosa

4 5 5.0 3.6 1.4 0.2 Iris-setosa

	Id	SepalLengthCm	SepalWidthCm	PetalLengthCm	PetalWidthCm	Species
0	1	5.1	3.5	1.4	0.2	Iris-setosa
1	2	4.9	3.0	1.4	0.2	Iris-setosa
2	3	4.7	3.2	1.3	0.2	Iris-setosa
3	4	4.6	3.1	1.5	0.2	Iris-setosa
4	5	5.0	3.6	1.4	0.2	Iris-setosa

Dropping ID Column and Separating Target Variable from The Dataset

The first column is the Id column which has no relevance to flowers so, we will drop it using drop() function. The Species column is our target feature and tells us about the species the flowers belong to. We will separate it using pandas iloc slicing.

Python3

data = data.drop('Id', axis=1)
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
print("Shape of X is %s and shape \
    of y is %s" % (X.shape, y.shape))

Shape of X is (150, 4) and shape of y is (150,)

Unique Values in Our Dependent Variable

Since this is a classification task we may want to determine the total number of unique categories in our dependent variable.

Python3

total_classes = y.nunique()
print("Number of unique species in dataset are: ",total_classes)

Output :

Number of unique species in dataset are: 3

There are 3 unique classes in our dependent variable, We may want to see the count of these unique classes to check the balance in our dataset.

Python3

distribution = y.value_counts()
print(distribution)

Output:

Iris-virginica     50
Iris-setosa        50
Iris-versicolor    50
Name: Species, dtype: int64

Let’s dig deep into our dataset, and we can see in the above image that our dataset contains 3 classes into which our flowers are distributed also since we have 150 samples all three species have an equal number of samples in the dataset, so we have no class imbalance.

Splitting The Dataset

Now, we will split the dataset for training and validation purposes, the validation set is 25% of the total dataset. For dividing the dataset into training and testing we will use train_test_split method from the sklearn model selection.

Python3

X_train, X_val, Y_train, Y_val = train_test_split(
    X, y, test_size=0.25, random_state=28)

Applying CatBoost to The Data

Python3

# Define the hyperparameters for the CatBoost algorithm
params = {'learning_rate': 0.1, 'depth': 6,\
          'l2_leaf_reg': 3, 'iterations': 100}
  
# Initialize the CatBoostClassifier object 
# with the defined hyperparameters and fit it on the training set
model = CatBoostClassifier(**params)
model.fit(X_train, Y_train)

Accuracy Of the CatBoost Model

Python3

# Predict the target variable on the validation
# set and evaluate the performance
y_pred = model.predict(X_val)
accuracy = (y_pred == np.array(Y_val)).mean()
print("Validation Accuracy:", accuracy)

Output:

Validation Accuracy: 0.33518005540166207

CatBoost in Machine Learning

What is CatBoost

Features of CatBoost

CatBoost Comparison results with other Boosting Algorithm

CatBoost Installation

For installing CatBoost in Python

For Installing CatBoost In R

Python Implementation of CatBoost

Importing libraries For CatBoost

Python3

Reading And Describing The Dataset

Python3

Python3

Dropping ID Column and Separating Target Variable from The Dataset

Python3

Unique Values in Our Dependent Variable

Python3

Python3

Splitting The Dataset

Python3

Applying CatBoost to The Data

Python3

Accuracy Of the CatBoost Model

Python3

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US