Prerequisites: Decision Tree, DecisionTreeClassifier, sklearn, numpy, pandas
Decision Tree is one of the most powerful and popular algorithm. Decision-tree algorithm falls under the category of supervised learning algorithms. It works for both continuous as well as categorical output variables.
In this article, We are going to implement a Decision tree algorithm on the Balance Scale Weight & Distance Database presented on the UCI.
Data-set Description :
Title : Balance Scale Weight & Distance Database Number of Instances : 625 (49 balanced, 288 left, 288 right) Number of Attributes : 4 (numeric) + class name = 5 Attribute Information:
- Class Name (Target variable): 3
- L [balance scale tip to the left]
- B [balance scale be balanced]
- R [balance scale tip to the right]
Missing Attribute Values: None
Class Distribution:
- 46.08 percent are L
- 07.84 percent are B
- 46.08 percent are R
You can find more details of the dataset here.
Used Python Packages:
- sklearn :
- In python, sklearn is a machine learning package which include a lot of ML algorithms.
- Here, we are using some of its modules like train_test_split, DecisionTreeClassifier and accuracy_score.
- NumPy :
- It is a numeric python module which provides fast maths functions for calculations.
- It is used to read data in numpy arrays and for manipulation purpose.
- Pandas :
- Used to read and write different files.
- Data manipulation can be done easily with dataframes.
Installation of the packages :
In Python, sklearn is the package which contains all the required packages to implement Machine learning algorithm. You can install the sklearn package by following the commands given below.
using pip :
pip install -U scikit-learn
Before using the above command make sure you have scipy and numpy packages installed.
If you don’t have pip. You can install it using
python get-pip.py
using conda :
conda install scikit-learn
Assumptions we make while using Decision tree :
- At the beginning, we consider the whole training set as the root.
- Attributes are assumed to be categorical for information gain and for gini index, attributes are assumed to be continuous.
- On the basis of attribute values records are distributed recursively.
- We use statistical methods for ordering attributes as root or internal node.
Pseudocode :
- Find the best attribute and place it on the root node of the tree.
- Now, split the training set of the dataset into subsets. While making the subset make sure that each subset of training dataset should have the same value for an attribute.
- Find leaf nodes in all branches by repeating 1 and 2 on each subset.
While implementing the decision tree we will go through the following two phases:
- Building Phase
- Preprocess the dataset.
- Split the dataset from train and test using Python sklearn package.
- Train the classifier.
- Operational Phase
- Make predictions.
- Calculate the accuracy.
Data Import :
Data Slicing :
X = balance_data.values[:, 1:5] Y = balance_data.values[:,0]
X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3, random_state = 100)
Terms used in code :
Gini index and information gain both of these methods are used to select from the n attributes of the dataset which attribute would be placed at the root node or the internal node.
Gini index:
Entropy:
Information Gain
Accuracy score
Confusion Matrix
Below is the python code for the decision tree.
# Run this program on your local python # interpreter, provided you have installed # the required libraries. # Importing the required packages import numpy as np import pandas as pd from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report # Function importing Dataset def importdata(): balance_data = pd.read_csv( 'databases/balance-scale/balance-scale.data' , sep = ',' , header = None ) # Printing the dataswet shape print ( "Dataset Length: " , len (balance_data)) print ( "Dataset Shape: " , balance_data.shape) # Printing the dataset obseravtions print ( "Dataset: " ,balance_data.head()) return balance_data # Function to split the dataset def splitdataset(balance_data): # Separating the target variable X = balance_data.values[:, 1 : 5 ] Y = balance_data.values[:, 0 ] # Splitting the dataset into train and test X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.3 , random_state = 100 ) return X, Y, X_train, X_test, y_train, y_test # Function to perform training with giniIndex. def train_using_gini(X_train, X_test, y_train): # Creating the classifier object clf_gini = DecisionTreeClassifier(criterion = "gini" , random_state = 100 ,max_depth = 3 , min_samples_leaf = 5 ) # Performing training clf_gini.fit(X_train, y_train) return clf_gini # Function to perform training with entropy. def tarin_using_entropy(X_train, X_test, y_train): # Decision tree with entropy clf_entropy = DecisionTreeClassifier( criterion = "entropy" , random_state = 100 , max_depth = 3 , min_samples_leaf = 5 ) # Performing training clf_entropy.fit(X_train, y_train) return clf_entropy # Function to make predictions def prediction(X_test, clf_object): # Predicton on test with giniIndex y_pred = clf_object.predict(X_test) print ( "Predicted values:" ) print (y_pred) return y_pred # Function to calculate accuracy def cal_accuracy(y_test, y_pred): print ( "Confusion Matrix: " , confusion_matrix(y_test, y_pred)) print ( "Accuracy : " , accuracy_score(y_test,y_pred) * 100 ) print ( "Report : " , classification_report(y_test, y_pred)) # Driver code def main(): # Building Phase data = importdata() X, Y, X_train, X_test, y_train, y_test = splitdataset(data) clf_gini = train_using_gini(X_train, X_test, y_train) clf_entropy = tarin_using_entropy(X_train, X_test, y_train) # Operational Phase print ( "Results Using Gini Index:" ) # Prediction using gini y_pred_gini = prediction(X_test, clf_gini) cal_accuracy(y_test, y_pred_gini) print ( "Results Using Entropy:" ) # Prediction using entropy y_pred_entropy = prediction(X_test, clf_entropy) cal_accuracy(y_test, y_pred_entropy) # Calling main function if __name__ = = "__main__" : main() |
Data Information:
Dataset Length: 625 Dataset Shape: (625, 5) Dataset: 0 1 2 3 4 0 B 1 1 1 1 1 R 1 1 1 2 2 R 1 1 1 3 3 R 1 1 1 4 4 R 1 1 1 5Results Using Gini Index:
Predicted values: ['R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'R' 'R'] Confusion Matrix: [[ 0 6 7] [ 0 67 18] [ 0 19 71]] Accuracy : 73.4042553191 Report : precision recall f1-score support B 0.00 0.00 0.00 13 L 0.73 0.79 0.76 85 R 0.74 0.79 0.76 90 avg/total 0.68 0.73 0.71 188Results Using Entropy:
Predicted values: ['R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'L' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'L' 'L' 'L' 'R' 'L' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'L' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'R' 'R' 'R' 'R' 'R' 'L' 'R' 'L' 'R' 'R' 'L' 'R' 'L' 'R' 'L' 'R' 'L' 'L' 'L' 'L' 'L' 'R' 'R' 'R' 'L' 'L' 'L' 'R' 'R' 'R'] Confusion Matrix: [[ 0 6 7] [ 0 63 22] [ 0 20 70]] Accuracy : 70.7446808511 Report : precision recall f1-score support B 0.00 0.00 0.00 13 L 0.71 0.74 0.72 85 R 0.71 0.78 0.74 90 avg / total 0.66 0.71 0.68 188