We’re all familiar with terms like first, third, and developing the world when it comes to describing countries in relation to the word. “First-world” refers to the countries are richer, healthier, and more educated, while impoverish nations fall under the label of third-world. In addition, we occasionally hear “second-world” to describe countries that find themselves in the middle, for example, countries like Turkey, Colombia, and Thailand could arguably be defined as second-world.
But where do these labels come from? How do we define a country as developed or developing? First or second or third-world? And why should there be only two or types of categories of countries? How would countries be categorized if we used first, second, third, fourth, or fifth-world as our labels?
Using data from the World Bank’s Development Indicators database and clustering algorithm KMeans, I’m going to set out to answer these questions. I plan on deriving categories for the world’s countries based on attributes that measure wealth, education, health, and more.
Table of Contents
- Data wrangling, munging, and cleaning.
- Clustering and analysis
- Results
- Developed vs. developing
- Three-worlds clustering
- Five-worlds clustering
Data Wrangling, Munging, and Cleaning.
Acquiring the data was the easy part. All I had to do was download the dataset of World Bank Development Indicators for its Kaggle page. Extracting and cleaning the relevant data was a whole other story.
The biggest issues of this stage of the projects were:
- Deciding which of the thousands of metrics to use for my analysis
- Data arrangement: features were in a singular column, each feature had a value or NaN for each year represented in the data.
- The huge amount of NaN values.
To give you a better sense of what I’m dealing, here is a dataset I created that represent a much simpler version of the actual data.
My first step was to turn this transform this data into one where the indicators are columns, the countries are indices, and the cells have the most recent statistic.
Once I managed to complete that first step, next came narrowing down a dataset with thousands of features. Next, I dropped indicators and countries that had more than seven null values. This presented an issue for me because I needed to rid my data of features with high numbers of null values, but in doing so I could be getting rid of informative features. In addition, this meant that there would be a number of countries not included in the analysis which I expected going into this project, very small countries tend not to have to best and most up to date data.
I drastically reduced the presence null values in the data, with the vast majority of countries and indicators absent of them. For the few nulls left, I used imputation techniques to estimate their values.
After an exhausting amount of data cleaning and munging, my efforts left with a dataset with of 159 countries and 33 indicators. Before moving onto the clustering, I needed to implement to feature engineering to remove irrelevant and redundant features. This produced an analysis-ready dataset of 20 features, which are listed below.
- Renewable electricity output (% of total electricity output)
- CO2 emissions (metric tons per capita)
- Commercial bank branches (per 100,000 adults)
- Depth of credit information index (0=low to 8=high)
- Strength of legal rights index (0=weak to 12=strong)
- Mobile cellular subscriptions (per 100 people)
- Internet users (per 100 people)
- GDP per capita (current US$)
- Proportion of seats held by women in national parliaments (%)
- Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)
- Health expenditure per capita (current US$)
- Labor force, female (% of total labor force)
- Unemployment, total (% of total labor force)
- Net migration
- Mortality rate, infant (per 1,000 live births)
- Life expectancy at birth, total (years)
- Survival to age 65, female (% of cohort)
- Population, ages 0-14 (% of total)
- Age dependency ratio, young (% of working-age population)
- Urban population (% of total)
Clustering and Analysis
Due to the disparate collection of data, scaling was a requirement before applying clustering to the data, I used Sklearn’s StandardScaler method to complete this task. Another issue I had to take care of before clustering was dealing with the high level of multicollinearity in the data, which can distort the functioning of a KMeans algorithm. I expected this problem going into the project, it makes sense that infant mortality rate strongly correlates with internet usage rate and GDP per capita. My solution to deal with this problem was to use Principal Component Analysis.
My first round clustering produced exceptional scores that surpassed my expectations. Clustering the scaled data without PCA using three clusters produced a silhouette score of 0.31. Implementing PCA with 4 and then 2 components bumped that score up to 0.43 and 0.46.
I was more than elated to generate such significant scores, but I knew it was possible to go even higher and in order to do that I had to drop some more features.
Here’s how I devised a technique to come up with the worst performing features in my clustering analysis:
- Generated combinations of all the features with three features per combination.
- For each combination, I used it to create a subset of the data.
- Fit the data subset on KMeans algorithm with three clusters.
- Calculated the silhouette score on the data subset and its labels.
- Put the features and their corresponding scores into a dictionary.
- Pulled the 50 worst scoring feature combinations and extracted just the features.
- Counted the number of times each feature showed up in those 50 combinations.
- Took the six features that appeared most often in the top 50 worst combinations out of my original dataset.
The six dropped features were:
- Proportion of seats held by women in national parliaments (%)
- Strength of legal rights index (0=weak to 12=strong)
- Mobile cellular subscriptions (per 100 people)
- Unemployment, total (% of total labor force)
- Net Migration
- Commercial bank branches (per 100,000 adults)
This process paid ended up paying significant dividends, with my clustering performances significantly increasing across the board. A scaled dataset fit on two PCA components and a three-cluster KMeans algorithm produced a silhouette score of 0.58!
After applying clustering to derive 2, 3, and 5 cluster labels, I was now ready for the fun part!
Results
The best part of this project is visualizing a map of the world with each country color. Below are three interactive maps of the world displaying countries with their color-encoded label—countries not used in the analysis are colored white. Hover over a country to see a pop-up bar containing its label and four of its development indicators.
https://plot.ly/~GeorgeMcIntire/469
Our first map shows us countries into “developed” and “developing” world categories (red: developed, blue: developing.) The results produced some surprises for me. There are almost double as many developed countries as there are developing ones and almost all of Latin American and the Middle East fall underdeveloped.
https://plot.ly/~GeorgeMcIntire/471
In this map, the developing countries from the previous map have almost all been recategorized as the third world whereas the developed countries have been split into the first and second world. The second world is by far the most popular label in this map, claiming 45% of the world and at least one country on every continent. 30% of the world is third and 25% is first. The second world label is also by the most diverse one as well, with countries such as Peru, Poland, and Thailand. Europe is home to almost three-quarters of the first world countries while Africa claims almost 90% of the third world.
There are a number prominent geological divisions as in this map. Europe is almost perfectly split into eastern and western halves by the clusters. North and sub-Saharan Africa are divided by clusters as well.
With a map like this, it’s easy to say this just confirms a lot of what we know, but there some things that I found surprising. I didn’t expect Chile, Uruguay, Greece, and Slovakia to be classified as first world instead second world and nor did I expect Poland and Saudi Arabia to be classified as second instead of first.
https://plot.ly/~GeorgeMcIntire/473
Now, this is where things start to get really interesting, we’re in a relatively uncharted territory at this point. One of the things I was interested in seeing was the change of countries labels from a three-cluster to five-cluster analysis, for instance, what number of second world countries will stay second world in a five cluster world and what number of them would change to third or fourth world. Let’s find out.
Here is a cross-tabulated table of 3-world and 5-world frequencies.
As you can see, of the 72 second world countries in the three cluster world, 33 of them are classified as the second world in the five cluster world, 35 are classified as the third world, and only four fall under the fourth world. Two-thirds of the third world countries in the three-cluster world are classified as the fifth world and the other third fall under the fourth world in the five-cluster world. Lastly, the 39 first world countries from the three-cluster world are almost split evenly between first and second worlds in the five-cluster world.
If you see any other interesting patterns in the maps above please be sure to let us know on social media. Again this article isn’t meant to conclusive but rather to provoke a conversation on how categories can influence our understanding of the real world.
©ODSC2017