Decision trees (DTs) are one of the most popular algorithms in machine learning: they are easy to visualize, highly interpretable, super flexible, and can be applied to both classification and regression problems. DTs predict the value of a target variable by learning simple decision rules inferred from the data features.
In my post “The Complete Guide to Decision Trees,” I describe Decision Trees in detail: their real-life applications, different DT types and algorithms, and their pros and cons. Now it’s time to get pragmatic. How do you build a DT? And how do you apply it to real data? DTs are nothing but algorithms (or sequences of steps), which makes them perfect for programming languages. Let’s see how.
The Problem
The World Bank assigns the world’s economies into four income groups:
- High
- Upper-middle
- Lower-middle
- Low
This assignment is based on Gross National Income (GNI) per capita calculated using the Atlas method (measured in current US Dollars), and the categories are defined as of July 1 2018. Using data pre-processing techniques, I’ve created a dataset that also includes other variables by country like population, surface, purchasing power, GDP, and others. You can download the dataset under this link.
The goal of this Classification Tree is to predict the income group of a country based on the variables included in the dataset.
The Steps
You can cut down the complexity of building DTs by dealing with simpler sub-steps: each individual sub-routine in a DT will connect to other ones to increase complexity, and this construction will let you reach more robust models that are easier to maintain and improve. Now, let’s build a Classification Tree (a special type of DT) in Python.
Load data and describe the dataset
Loading a data file is the easy part. The problem (and most time-consuming part) usually refers to the data preparation process: setting the right data formats, dealing with missing values and outliers, eliminating duplicates, etc.
Before loading the data, we’ll import the necessary libraries:
import xlrd
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Now we load the dataset:
df_c = pd.read_excel(“macrodata_class.xlsx”)
Take a look at the data:
df_c.head()
We can see that there are some missing values. Let’s check the types of variables:
df_c.info()
From a total of 215 entries, we have missing values in almost all of our columns. Also, we can see that except from variables “country” and “class” (our target variable) which are defined as “objects”, the rest are numeric variables (float64).
We can count how many missing values are in each variable:
print(df_c.isnull().sum())
Let’s drop those missing values before building the DT:
df_c.dropna(inplace=True)
If we describe our dataset again, we see that those registers were removed:
df_c.info()
You can find additional techniques for exploratory data analysis on this link.
Select features and target variable
You need to divide your given columns into two types of variables: dependent (or target variable) and independent variable (or feature variables). First, we define the features:
X_c = df_c.iloc[:,1:8].copy()
And then the target variable:
y_c = df_c.iloc[:,8].copy()
Split the dataset
To understand model performance, dividing the dataset into a training set and a test set is a good strategy. By splitting the dataset into two separate sets, we can train using one set and test using another.
- Training set: this data is used to build your model. E.g. using the CART algorithm to create a Decision Tree.
- Testing set: this data is used to see how the model performs on unseen data, as it would in a real-world situation. This data should be left completely unseen until you would like to test your model to evaluate performance.
X_c_train, X_c_test, y_c_train, y_c_test = train_test_split(X_c, y_c, test_size=0.3, random_state=432, stratify=y_c)
A couple of things here: we split our dataset into 70% train and 30% test, and performed a stratified sampling so that the proportion of values in the sample produced will be the same as the proportion of values provided in the target variable.
Build DT model and finetune
Building a DT is as simple as this:
clf = DecisionTreeClassifier(criterion=’gini’,min_samples_leaf=10)
In this case, we only defined the splitting criteria (chose Gini index instead of entropy) and defined only one hyperparameter (the minimum amount of samples by leaf). Parameters which define the model architecture are referred to as hyperparameters and thus, the process of searching for the ideal model architecture (the one that maximizes the model performance) is referred to as hyperparameter tuning. A hyperparameter is a parameter whose value is set before the learning process begins, and they can’t be directly trained from the data.
You can take a look at the rest of the hyperparameters you can tune by calling the model:
clf
Models can have many hyperparameters and there are different strategies for finding the best combination of parameters. You can take a look at some of them on this link.
Train DT model
Fitting your model to the training data represents the training part of the modeling process. After it is trained, the model can be used to make predictions, with a predict method call:
model_c = clf.fit(X_c_train, y_c_train)
Test DT model
A test dataset is a dataset that is independent of the training dataset. This test dataset is the unseen data set for your model which will help you generalizing it:
y_c_pred = model_c.predict(X_c_test)
Visualize
One of the biggest strengths of Decision Trees is their interpretability. Visualizing DTs is not only a powerful way to understand your model, but also to communicate how your model works:
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(model_c, feature_names=list(X_c), class_names=sorted(y_c.unique()), filled=True)
graphviz.Source(dot_data)
The variable “purchasing.power.cap” seems to be very important to define the class or target variable: high income countries are located on the right, upper middle income in the middle, and low/lower middle income on the left.
Evaluate Performance
Evaluating your machine learning algorithm is an essential part of any project: how can you measure its success and when do you know that it shouldn’t be improved any more? Different machine learning algorithms have varying evaluation metrics. Let’s mention some of the main ones for Classification Trees:
Accuracy score
Accuracy in classification problems is the number of correct predictions made by the model over all kinds of predictions made.
print ('Accuracy is:',(accuracy_score(y_c_test,y_c_pred)))
Our accuracy is 71,9%. Not bad if we consider that we can improve our model by generating new features, or adjusting hyperparameters. But this is a global metric, so let’s get into more details with other measures.
Confusion matrix
The confusion matrix is one of the most intuitive metrics used for finding the correctness and accuracy of the model. It is used for classification problems where the output can be of two or more types of classes. To understand it, first, we need to define some terms:
- True positive(TP): shows that a model correctly predicted Positive cases as Positive. E.g. an illness diagnosed as present is truly present.
- False positive(FP): shows that a model incorrectly predicted Negative cases as Positive. E.g. an illness diagnosed as present is absent (Type I error).
- False Negative (FN): shows that an incorrectly model predicted Positive cases as Negative. E.g. an illness diagnosed as absent is present (Type II error).
- True Negative (TN): shows that a model correctly predicted Negative cases as Negative. E.g. an illness diagnosed as absent is truly absent.
cmatrix = confusion_matrix(y_c_test,y_c_pred, labels=y_c_test.unique())
pd.DataFrame(cmatrix, index=y_c_test.unique(), columns=y_c_test.unique())
In the case of a multi-class confusion matrix like ours, the matrix will extend up to the number of classes (in our example 4 x 4). Our DT predicted correctly 17 out of the 19 “High income” instances, 6 out of the 8 “Low income” cases, 8 out of the 13 “Lower middle income” instances, and 10 out of the 17 “Upper middle income” cases.
For a full explanation of a multi-class confusion matrix, check this article.
Classification report
The classification report shows a representation of the main classification metrics on a per-class basis. This gives a deeper intuition of the classifier behavior over global accuracy which can mask functional weaknesses in one class of a multi-class problem. Classification report integrates different metrics such as:
- Precision (TP/(TP+FP): is the ratio of correctly predicted positive observations to the total predicted positive observations. For each class, it is defined as the ratio of true positives to the sum of true and false positives. Said another way, how “precise” is the classifier when predicting positive instances?
- Recall (TP/(TP+FN): is the ability of a classifier to find all positive instances. For each class, it is defined as the ratio of true positives to the sum of true positives and false negatives. Said another way, “for all instances that were actually positive, what percent was classified correctly?”
- F1-Score (2*((Precision*Recall)/(Precision+Recall))): is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.
- Support: is the number of actual occurrences of the class in the specified dataset. Imbalanced support in the training data may indicate structural weaknesses in the reported scores of the classifier and could indicate the need for stratified sampling or rebalancing. Support doesn’t change between models but instead diagnoses the evaluation process.
report = classification_report(y_c_test, y_c_pred)
print(report)
Feature importance
Another key metric consists of assigning scores to input features of a predictive model, indicating the relative importance of each feature when making a prediction. Feature importance provides insights into the data, the model, and represents the basis for dimensionality reduction and feature selection, which can improve the performance of a predictive model. The more an attribute is used to make key decisions with the DT, the higher its relative importance.
for importance, name in sorted(zip(clf.feature_importances_, X_c_train.columns),reverse=True):
print (name, importance)
The variable “purchasing.power.cap” has extreme importance in relation to all other variables (being the main feature of the model), which makes total sense if we think about the target variable.
Wrap Up
Although we covered several steps during our modeling, each one of those concepts is a discipline on their own: exploratory data analysis, feature engineering, or hyperparameter tuning are all extensive and complex aspects of any machine learning model. You should consider going deeper into those subjects.
Also, Decision Trees are the basis of more powerful algorithms called ensemble methods. Ensemble methods combine several DTs to produce better predictive performance than single DTs. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, significantly improving the performance of a single DT. They are used to decrease the model’s variance and bias and improve predictions. Now that you saw how a Decision Tree works, I suggest you move forward with ensemble methods like Bagging or Boosting.