If you are a machine learning enthusiast you must have done the Titanic project in which you would have predicted whether a person will survive or not.
Spaceship Titanic Project using Machine Learning in Python
In this article, we will try to solve one such problem which is a slightly modified version of Titanic which is the Spaceship Titanic. The problem statement of this project is like a spaceship having people from different planets on a voyage but due to some reasons, some people have been transported to another dimension. Our task is to predict who will get transported and who will remain on the spaceship.
Importing Libraries and Dataset
Python libraries make it easy for us to handle the data and perform typical and complex tasks with a single line of code.
- Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.
- Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
- Matplotlib/Seaborn – This library is used to draw visualizations.
- Sklearn – This module contains multiple libraries that have pre-implemented functions to perform tasks from data preprocessing to model development and evaluation.
- XGBoost – This contains the eXtreme Gradient Boosting machine learning algorithm which is one of the algorithms that helps us to achieve high accuracy on predictions.
Python3
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sb from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn import metrics from sklearn.svm import SVC from xgboost import XGBClassifier from sklearn.linear_model import LogisticRegression import warnings warnings.filterwarnings( 'ignore' ) |
Now let’s load the dataset into the panda’s data frame and print its first five rows.
Python3
df = pd.read_csv( 'spaceship_titanic.csv' ) df.head() |
Output:
The data present in different columns have the meaning as follows:
HomePlanet | The home planet of the passenger |
CryoSleep | This is a kind of animation in which a passenger will be suspended during the whole voyage and remain confined to their cabin. |
VIP | Indicates whether the person has opted for VIP service or not. |
RoomService, FoodCourt, shopping mall, Spa, VRDeck |
Commodities on which passengers of the spaceship can choose to spend. |
Transported | This is the target column. This indicates whether the passenger has been transported to another dimension or not. |
Python3
df.shape |
Output:
(8693, 14)
Let’s check which column of the dataset contains which type of data.
Python3
df.info() |
Output:
As per the above information regarding the data in each column we can observe that there are null values in approximately all the columns.
Python3
df.describe() |
Output:
Data Cleaning
The data which is obtained from the primary sources is termed the raw data and required a lot of preprocessing before we can derive any conclusions from it or do some modeling on it. Those preprocessing steps are known as data cleaning and it includes, outliers removal, null value imputation, and removing discrepancies of any sort in the data inputs.
Python3
df.isnull(). sum ().plot.bar() plt.show() |
Output:
One of the naive methods to perform the imputation is to simply impute null by mean in the case of continuous data and mode in the case of categorical values but here we will try to explore the relationship between independent features and then we’ll use them to impute the null values smartly.
Python3
col = df.loc[:, 'RoomService' : 'VRDeck' ].columns df.groupby( 'VIP' )[col].mean() |
Output:
As expected expenditure of VIP people is a little bit on the higher side as compared to those who are non-VIP.
Python3
df.groupby( 'CryoSleep' )[col].mean() |
Output:
Passengers in CryoSleep are confined to their cabins and suspended in the animation during the whole voyage so, they won’t be able to spend on the services available onboard. Hence we can simply put 0 in the case where CryoSleep is equal to True.
Python3
temp = df[ 'CryoSleep' ] = = True df.loc[temp, col] = 0.0 |
By using the relation between the VIP people and their expenditure on different leisures let’s impute null values in those columns.
Python3
for c in col: for val in [ True , False ]: temp = df[ 'VIP' ] = = val k = df[temp].mean() df.loc[temp, c] = df.loc[temp, c].fillna(k) |
Now let’s explore the relationship between the VIP feature and HomePlanet Feature.
Python3
sb.countplot(data = df, x = 'VIP' , hue = 'HomePlanet' ) plt.show() |
Output:
Here we can observe that there is a significant relationship between being a non-VIP and coming from Earth and being a VIP and coming from Europa. The probability of these two events is high.
Python3
col = 'HomePlanet' temp = df[ 'VIP' ] = = False df.loc[temp, col] = df.loc[temp, col].fillna( 'Earth' ) temp = df[ 'VIP' ] = = True df.loc[temp, col] = df.loc[temp, col].fillna( 'Europa' ) |
We will simply impute the age null values by mean but before that, we will check for outliers.
Python3
sb.boxplot(df[ 'Age' ],orient = 'h' ) plt.show() |
Output:
We will calculate the mean by excluding outliers and then impute the nulls by that value.
Python3
temp = df[df[ 'Age' ] < 61 ][ 'Age' ].mean() df[ 'Age' ] = df[ 'Age' ].fillna(temp) |
Now let’s explore the relation between CryoSleep and Transported.
Python3
sb.countplot(data = df, x = 'Transported' , hue = 'CryoSleep' ) plt.show() |
Output:
Here we can observe that those who are in CryoSLeep have higher chances of getting transported but we cannot use the relation between the target column and independent feature to impute it else we will have to face Data Leakage.
Python3
df.isnull(). sum ().plot.bar() plt.show() |
Output:
So, there are still null values in the data. We tried to fill as many null values as possible using the relation between different features. Now let’s fill the remaining values using the Naive method of filling null values that we discussed earlier.
Python3
for col in df.columns: # Check for null values presence if df[col].isnull(). sum () = = 0 : continue # If the data type is categorical filling by mode. if df[col].dtype = = object or df[col].dtype = = bool : df[col] = df[col].fillna(df[col].mode()[ 0 ]) # Else by mean else : df[col] = df[col].fillna(df[col].mean()) df.isnull(). sum (). sum () |
Output:
0
Finally, we get rid of the null values from the dataset.
Feature Engineering
There are times when multiple features are provided in the same feature or we have to derive some features from the existing ones. We will also try to include some extra features in our dataset so, that we can derive some interesting insights from the data we have. Also if the features derived are meaningful then they become a deciding factor in increasing the model’s accuracy significantly.
Python3
df.head() |
Output:
Here we can see that PassengerId and Cabin seem to contain some information in the cubbed form. Like in the PassengerId RoomNo_PassengerNo is an expected way to write the clubbed information.
Python3
new = df[ "PassengerId" ]. str .split( "_" , n = 1 , expand = True ) df[ "RoomNo" ] = new[ 0 ].astype( int ) df[ "PassengerNo" ] = new[ 1 ].astype( int ) df.drop([ 'PassengerId' , 'Name' ], axis = 1 , inplace = True ) |
Now we will fill each room no with the maximum number of passengers it is holding.
Python3
data = df[ 'RoomNo' ] for i in range (df.shape[ 0 ]): temp = data = = data[i] df[ 'PassengerNo' ][i] = (temp). sum () #this was throwing an indentation error |
Now RoomNo does not have any relevance in getting Transported so, we’ll remove it.
Python3
df.drop([ 'RoomNo' ], axis = 1 , inplace = True ) sb.countplot(data = df, x = 'PassengerNo' , hue = 'VIP' ) plt.show() |
Output:
Here it is clear that VIP people sharing a room is not that common.
Python3
new = df[ "Cabin" ]. str .split( "/" , n = 2 , expand = True ) data[ "F1" ] = new[ 0 ] df[ "F2" ] = new[ 1 ].astype( int ) df[ "F3" ] = new[ 2 ] df.drop([ 'Cabin' ], axis = 1 , inplace = True ) |
Now let’s combine all the expenses into one column and name it as
Python3
df[ 'LeasureBill' ] = df[ 'RoomService' ] + df[ 'FoodCourt' ]\ + df[ 'ShoppingMall' ] + df[ 'Spa' ] + df[ 'VRDeck' ] |
Exploratory Data Analysis
EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to check assumptions with the help of statistical summaries and graphical representations. Although we have explored the relationship between different independent features in the data cleaning part up to a great extent there are some things that are still left.
Python3
x = df[ 'Transported' ].value_counts() plt.pie(x.values, labels = x.index, autopct = '%1.1f%%' ) plt.show() |
Output:
Data is balanced for both the classes which are good news with respect to the model’s training.
Python3
df.groupby( 'VIP' ).mean()[ 'LeasureBill' ].plot.bar() plt.show() |
Output:
High LeasureBill is normal for VIP category people.
Python3
for col in df.columns: # In case of categorical column # encode them if df[col].dtype = = object : le = LabelEncoder() df[col] = le.fit_transform(df[col]) # In case of boolean data type # convert them to binary if df[col].dtype = = 'bool' : df[col] = df[col].astype( int ) df.head() |
Output:
Now let’s check the data for the presence of any highly correlated features.
Python3
plt.figure(figsize = ( 10 , 10 )) sb.heatmap(df.corr()> 0.8 , annot = True , cbar = False ) plt.show() |
Output:
From the above heat map, we can see that there are no highly correlated features which implies we are good to go for our model development part.
Model Training
Now we will separate the features and target variables and split them into training and the testing data by using which we will select the model which is performing best on the validation data.
Python3
features = df.drop([ 'Transported' ], axis = 1 ) target = df.Transported X_train, X_val,\ Y_train, Y_val = train_test_split(features, target, test_size = 0.1 , random_state = 22 ) X_train.shape, X_val.shape |
Output:
((7823, 15), (870, 15))
Now, let’s normalize the data to obtain stable and fast training.
Python3
scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) |
Now let’s train some state-of-the-art machine learning models and compare them which fit better with our data.
Python3
from sklearn.metrics import roc_auc_score as ras models = [LogisticRegression(), XGBClassifier(), SVC(kernel = 'rbf' , probability = True )] for i in range ( len (models)): models[i].fit(X_train, Y_train) print (f '{models[i]} : ' ) train_preds = models[i].predict_proba(X_train)[:, 1 ] print ( 'Training Accuracy : ' , ras(Y_train, train_preds)) val_preds = models[i].predict_proba(X_val)[:, 1 ] print ( 'Validation Accuracy : ' , ras(Y_val, val_preds)) print () |
Output:
LogisticRegression() :
Training Accuracy : 0.8690381072928551
Validation Accuracy : 0.8572836732098188
XGBClassifier() :
Training Accuracy : 0.9076025527327106
Validation Accuracy : 0.8802491838724721
SVC(probability=True) :
Training Accuracy : 0.8886869084652786
Validation Accuracy : 0.8619207614363845
Model Evaluation
From the above accuracies, we can say that XGBClassifier’s performance is the best among all the three models that we have trained. Let’s plot the confusion matrix as well for the validation data using the XGBClassifier model.
Python3
y_pred = models[ 1 ].predict(X_val) cm = metrics.confusion_matrix(Y_val, y_pred) disp = metrics.ConfusionMatrixDisplay(confusion_matrix = cm) disp.plot() plt.show() |
Output:
Here from the confusion matrix, we can conclude one thing the model is facing difficulty in predicting negative examples as negative.
Python3
print (metrics.classification_report (Y_val, models[ 1 ].predict(X_val))) |
Output:
precision recall f1-score support
0 0.82 0.79 0.81 458
1 0.78 0.80 0.79 412
accuracy 0.80 870
macro avg 0.80 0.80 0.80 870
weighted avg 0.80 0.80 0.80 870