This article was published as a part of the Data Science Blogathon
Overview
This Pokemon will analyze the pokemon dataset and predict whether the Pokemon is legendary based on the features provided. We will discuss everything from scratch; we will go from CSV to model building with line by line explanation of code. Let’s get started.
Image source: Pokejungle
Takeaways
- Understand how to analyze the dataset before carrying forward to the model building phase.
- Getting the insights from the data.
- Visualization of the dataset.
- Model building
- Saving model.
About the dataset
This dataset has 721 unique values i.e. it has features of 721 unique pokemon; for further details, visit this link.
Importing necessary libraries
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report from sklearn.ensemble import RandomForestClassifier
Reading the dataset
pokemon_data = pd.read_csv('Pokemon Data.csv')
Now, let’s see what our dataset has in it!
poke = pd.DataFrame(pokemon_data) poke.head()
Output:
Checking out folet’sl values
poke.isnull().sum()
Output:
Number 0 Name 0 Type_1 0 Type_2 371 Total 0 HP 0 Attack 0 Defense 0 Sp_Atk 0 Sp_Def 0 Speed 0 Generation 0 isLegendary 0 Color 0 hasGender 0 Pr_Male 77 Egg_Group_1 0 Egg_Group_2 530 hasMegaEvolution 0 Height_m 0 Weight_kg 0 Catch_Rate 0 Body_Style 0 dtype: int64
We have seen the null values in its users n; let’s visualize them using the heatmap.
plt.figure(figsize=(10,7)) sns.heatmap(poke.isnull(), cbar=False)
Output:
Here it’s visible that Type_2, Pr_Male, and Egg_Group_2 have relatively null values.
We have visualized the nucan’tlues using the heatmap but in that kind of visualization, we can’t get the count of Let’s null values, so we are using the dist-plot.
plt.figure(figsize=(20,20)) sns.displot( data=poke.isna().melt(value_name="missing"), y="variable", hue="missing", multiple="fill", aspect=2 )
Output:
Let’s know the dimensions of our dataset.
poke.shape
Output:
(721, 23)
From the shape, it is clear the dataset is small, meaning we can remove the null values columns as filling them can make the dataset a little biased.
We have seen that type_2, egg_group_2, and Pr_male have null values.
poke['Pr_Male'].value_counts()
Output:
0.500 458 0.875 101 0.000 23 0.250 22 0.750 19 1.000 19 0.125 2 Name: Pr_Male, dtype: int64
Since Type_2 and Egg_group_2 columns have so many NULL values we will be removing those columns, you won’t impute them with other methods, but for simplicity, we won’t do that here. We only set the Pr_Male column since it had only 77 missing values.
poke['Pr_Male'].fillna(0.500, inplace=True) poke['Pr_Male'].isnull().sum()
Output:
0 # as we can see that there are no null values now.
Dropping unnecessary columns
new_poke = poke.drop(['Type_2', 'Egg_Group_2'], axis=1)
Now let’s understand the type of each column and its values.
new_poke.describe()
Note : (20, 20000) -> x -min/ max-min -> x = 300 -> 300-20/19980 -> a very small value
Output:
plt.figure(figsize=(10,10)) sns.heatmap(new_poke.corr(),annot=True,cmap='viridis',linewidths=.5)
Output:
- The above is a correlation graph that tells you how much a feature is correlated to another since a high correlation means one of the two features does not speak much to the model when predicting.
- Usually, it is to be determined by you itself for the high value of correlation and removed.
- From the above table, it is clear that different features have different ranges of value, which creates complexity for the model, so we tone them down usually using StandardScalar() class which we will do later on.
new_poke['Type_1'].value_counts()
Output:
Water 105 Normal 93 Grass 66 Bug 63 Psychic 47 Fire 47 Rock 41 Electric 36 Ground 30 Poison 28 Dark 28 Fighting 25 Dragon 24 Ice 23 Ghost 23 Steel 22 Fairy 17 Flying 3 Name: Type_1, dtype: int64
Value counts of all the generations
new_poke['Generation'].value_counts()
Output:
5 156 1 151 3 135 4 107 2 100 6 72 Name: Generation, dtype: int64
Visualizing I’me categorical values
Here for visualizing the categorical data, I’m using seaborn’s cat plot() function. Well, one can use the line plot scatter plot or box plot separately, but here, the cat plot brings up the unified version of using all the plots hence I preferred the cat plot rather than the separate version of eI’m plot.
Here for counting each type (6) category of generations, I’m using the cougeneration’snd in the cat plot to get the number of count of each generation’s column.
sns.catplot(x="Generation",kind="count",palette="ch:.25", data=poke)
Output:
Inference: In the above graph, the 5th generation is the most in numbers.
Here we are using the default kind of cat plot, i.e. scatter plot to plot the Generation vs Defense graph where we will be able to figure outPokemonlationship between the defence power of each general Pokemon.
sns.catplot(x="Generation", y="Defense", data=poke)
Output:
Inference: Here, we can see that only two pcan’tn in generation 2 have the highest defence capability. Still, we can’t conclude that generation 2 has the most increased defence capabilities as the outliers. Still, in the graph, it is evident that generation 6 and 4 has the highest defence capabilities.
Here we are using the Box plot because boxplot will help us understand the variations in the large dataset better; it will also let us know about the outliers more clearly.
sns.catplot(x="Generation", y="Attack",kind="boxen", data=poke)
Output:
- Here in the above boxplot, we can see that there are a lot of outliers in generation 4 and generation 1 when it comes to attacking capabilities.
- Also, generation 4 has the highest median values of their attacking capabilities than all the other generations.
Now we are using bar kind via cat plot, which will let us know about the Attacking capabilities of different generations based on their Pokemon. For example, in generation 1, the pokemon power of male Pokemon are higher than those of the female Pokemon of the same generation. Still, that generation also has the least attacking power than other generations.
sns.catplot(x="Generation", y="Attack",kind='bar',hue='hasGender', data=poke)
Output:
FromPokemonove graph, we can conclude that,
- In generaPokemononly the male Pokemon has more attacking power than the female Pokemon, which contradicts other generations.
- Generation 6 has the highest attacking power wLet’sgeneration 1 has the lowest attacking power.
new_poke['Color'].value_counts()
Output:
Blue 134 Brown 110 Green 79 Red 75 Grey 69 Purple 65 Yellow 64 White 52 Pink 41 Black 32 Name: Color, dtype: int64 new_poke['Egg_Group_1'].value_counts()
Output:
Field 169 Monster 74 Water_1 74 Undiscovered 73 Bug 66 Mineral 46 Flying 44 Amorphous 41 Human-Like 37 Fairy 30 Grass 27 Water_2 15 Water_3 14 Dragon 10 Ditto 1 Name: Egg_Group_1, dtype: int64
Let’s also consider the number of values in our target column
new_poke['isLegendary'].value_counts()
Output:
False 675 True 46 Name: isLegendary, dtype: int64
Feature Engineering
Creating new categories or merging categories, so it is easy to work with afterwards.
This may seem uncomfortable to some, but you will get why I did it like that.
poke_type1 = new_poke.replace(['Water', 'Ice'], 'Water') poke_type1 = poke_type1.replace(['Grass', 'Bug'], 'Grass') poke_type1 = poke_type1.replace(['Ground', 'Rock'], 'Rock') poke_type1 = poke_type1.replace(['Psychic', 'Dark', 'Ghost', 'Fairy'], 'Dark') poke_type1 = poke_type1.replace(['Electric', 'Steel'], 'Electric')
poke_type1['Type_1'].value_counts()
Output:
Grass 129 Water 128 Dark 115 Normal 93 Rock 71 Electric 58 Fire 47 Poison 28 Fighting 25 Dragon 24 Flying 3 Name: Type_1, dtype: int64
ref1 = dict(poke_type1['Body_Style'].value_counts()) poke_type1['Body_Style_new'] = poke_type1['Body_Style'].map(ref1)
You may be wondering what I did; I took the value counts of each body tyLet’sd replace the body type with the numbers; see below
poke_type1['Body_Style_new'].head()
Output:
0 135 1 135 2 135 3 158 4 158 Name: Body_Style_new, dtype: int64
Let’s look towards the Body_style
poke_type1['Body_Style'].head()
Output:
0 quadruped 1 quadruped 2 quadruped 3 bipedal_tailed 4 bipedal_tailed Name: Body_Style, dtype: object
Encoding data – features like Type_1 and Color
types_poke = pd.get_dummies(poke_type1['Type_1']) color_poke = pd.get_dummies(poke_type1['Color']) X = pd.concat([poke_type1, types_poke], axis=1) X = pd.concat([X, color_poke], axis=1) X.head()
Output:
Now we have built some features and extracted some feature data, what’s left is to remove redundant features
X.columns
Output:
Index(['Number', 'Name', 'Type_1', 'Total', 'HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed', 'Generation', 'isLegendary', 'Color', 'hasGender', 'Pr_Male', 'Egg_Group_1', 'hasMegaEvolution', 'Height_m', 'Weight_kg', 'Catch_Rate', 'Body_Style', 'Body_Style_new', 'Dark', 'Dragon', 'Electric', 'Fighting', 'Fire', 'Flying', 'Grass', 'Normal', 'Poison', 'Rock', 'Water', 'Black', 'Blue', 'Brown', 'Green', 'Grey', 'Pink', 'Purple', 'Red', 'White', 'Yellow'], dtype='object')
X_ = X.drop([‘Number’, ‘Name’, ‘let’s1’, ‘Color’, ‘Egg_Group_1’], axis = 1)
X_.shape
Output:
(721, 38)
Now, let’s see the shape of our updated feature columns
X.shape
Lastly, we define our target variable and set it into a variable called y
y = X_['isLegendary'] X_final = X_.drop(['isLegendary', 'Body_Style'], axis = 1) X_final.columns
Output:
Index(['Total', 'HP', 'Attack', 'Defense', 'Sp_Atk', 'Sp_Def', 'Speed', 'Generation', 'hasGender', 'Pr_Male', 'hasMegaEvolution', 'Height_m', 'Weight_kg', 'Catch_Rate', 'Body_Style_new', 'Dark', 'Dragon', 'Electric', 'Fighting', 'Fire', 'Flying', 'Grass', 'Normal', 'Poison', 'Rock', 'Water', 'Black', 'Blue', 'Brown', 'Green', 'Grey', 'Pink', 'Purple', 'Red', 'White', 'Yellow'], dtype='object')
X_final.head()
Output:
Creating and training our model
Splitting the dataset into training and testing dataset
Xtrain, Xtest, ytrain, ytest = train_test_split(X_final, y, test_size=0.2)
Using random forest classifier for training our model
random_model = RandomForestClassifier(n_estimators=500, random_state = 42)
Fitting the model
model_final = random_model.fit(Xtrain, ytrain) y_pred = model_final.predict(Xtest)
Checking the accuracy
random_model_accuracy = round(model_final.score(Xtrain, ytrain)*100,2) print(round(random_model_accuracy, 2), '%')
Output:
100.0 %
Getting the accuracy of the model
random_model_accuracy1 = round(random_model.score(Xtest, ytest)*100,2) print(round(random_model_accuracy1, 2), '%')
Output:
99.31 %
Saving the model to disk
import pickle filename = 'pokemon_model.pickle' pickle.dump(model_final, open(filename, 'wb'))
Load the model from the disk
filename = 'pokemon_model.pickle' loaded_model = pickle.load(open(filename, 'rb')) result = loaded_model.score(Xtest, ytest) result*100
Output:
99.3103448275862
Conclusion
Here I conclude the legendary pokemon prediction with 99% accuracy; this might be a overfit model; having said that, the dataset was not so complex that it will lead to such a situaHere’set all the suggestions and improvements are always welcome.
Here’s the repo link to this article.
Here you can access my other articles, which are published on Analytics Vidhya as a part of the Blogathon (link)
If got any queries you can connect with I’m on LinkedIn, refer to this link
About me
Greeting to everyone, I’m currently working in TCS and previously, I worked as a Data Science AssociI’veAnalyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field, i.e. Data Science, along with its other subsets of Artificial Intelligence such as Computer Vision, Machine learning, and Deep learning; feel free to collaborate with me on any project on the domains mentioned above (LinkedIn).
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.