Prerequisites: Understanding Logistic Regression, Logistic Regression using Python
In this article, we are going to discuss how to predict the placement status of a student based on various student attributes using Logistic regression algorithm.
Placements hold great importance for students and educational institutions. It helps a student to build a strong foundation for the professional career ahead as well as a good placement record gives a competitive edge to a college/university in the education market.
This study focuses on a system that predicts if a student would be placed or not based on the student’s qualifications, historical data, and experience. This predictor uses a machine-learning algorithm to give the result.
The algorithm used is logistic regression. Logistic regression is basically a supervised classification algorithm. In a classification problem, the target variable(or output), y, can take only discrete values for given set of features(or inputs), X. Talking about the dataset, it contains the secondary school percentage, higher secondary school percentage, degree percentage, degree, and work experience of students. After predicting the result its efficiency is also calculated based on the dataset. The dataset used here is in .csv format.
Below is the step-by-step Approach:
Step 1: Import the required modules.
Python
# import modules import pandas as pd import numpy as np import matplotlib.pyplot as plt |
Step 2: Now to read the dataset that we are going to use for the analysis and then checking the dataset.
Python
# reading the file dataset = pd.read_csv( 'Placement_Data_Full_Class.csv' ) dataset |
Output:
Step 3: Now we will drop the columns that are not needed.
Python
# dropping the serial no and salary col dataset = dataset.drop( 'sl_no' , axis = 1 ) dataset = dataset.drop( 'salary' , axis = 1 ) |
Step 4: Now before moving forward we need to pre-process and transform our data. For that, we will use astype() method on some columns and change the datatype to category.
Python
# catgorising col for further labelling dataset[ "gender" ] = dataset[ "gender" ].astype( 'category' ) dataset[ "ssc_b" ] = dataset[ "ssc_b" ].astype( 'category' ) dataset[ "hsc_b" ] = dataset[ "hsc_b" ].astype( 'category' ) dataset[ "degree_t" ] = dataset[ "degree_t" ].astype( 'category' ) dataset[ "workex" ] = dataset[ "workex" ].astype( 'category' ) dataset[ "specialisation" ] = dataset[ "specialisation" ].astype( 'category' ) dataset[ "status" ] = dataset[ "status" ].astype( 'category' ) dataset[ "hsc_s" ] = dataset[ "hsc_s" ].astype( 'category' ) dataset.dtypes |
Output:
Step 5: Now we will apply codes on some of these columns to convert their text values to numerical values.
Python
# labelling the columns dataset[ "gender" ] = dataset[ "gender" ].cat.codes dataset[ "ssc_b" ] = dataset[ "ssc_b" ].cat.codes dataset[ "hsc_b" ] = dataset[ "hsc_b" ].cat.codes dataset[ "degree_t" ] = dataset[ "degree_t" ].cat.codes dataset[ "workex" ] = dataset[ "workex" ].cat.codes dataset[ "specialisation" ] = dataset[ "specialisation" ].cat.codes dataset[ "status" ] = dataset[ "status" ].cat.codes dataset[ "hsc_s" ] = dataset[ "hsc_s" ].cat.codes # display dataset dataset |
Output:
Step 6: Now to split the dataset into features and values using iloc() function:
Python
# selecting the features and labels X = dataset.iloc[:, : - 1 ].values Y = dataset.iloc[:, - 1 ].values # display dependent variables Y |
Output:
Step 7: Now we will split the dataset into train and test data which will be used to check the efficiency later.
Python
# dividing the data into train and test from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2 ) # display dataset dataset.head() |
Output:
Step 8: Now we need to train our model for which we will need to import a file, and then we will create a classifier using sklearn module. Then we will check the accuracy of the model.
Python
# creating a classifier using sklearn from sklearn.linear_model import LogisticRegression clf = LogisticRegression(random_state = 0 , solver = 'lbfgs' , max_iter = 1000 ).fit(X_train, Y_train) # printing the acc clf.score(X_test, Y_test) |
Output:
Step 9: Once we have trained the model, we will check it giving some random values:
Python
# predicting for random value clf.predict([[ 0 , 87 , 0 , 95 , 0 , 2 , 78 , 2 , 0 , 0 , 1 , 0 ]]) |
Output:
Step 10: To gain a more nuanced understanding of our model’s performance we need to make a confusion matrix. A confusion matrix is a table with two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives.
To get the confusion matrix it takes in two arguments: The actual labels of your test set y_test and predicted labels. The predicted labels of the classifier are stored in y_pred as follows:
Python
# creating a Y_pred for test data Y_pred = clf.predict(X_test) # display predicted values Y_pred |
Output:
Step 11: Finally, we have y_pred, so we can generate the confusion matrix:
Python
# evaluation of the classifier from sklearn.metrics import confusion_matrix, accuracy_score # display confusion matrix print (confusion_matrix(Y_test, Y_pred)) # display accuracy print (accuracy_score(Y_test, Y_pred)) |
Output: