Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
Let’s use a real dataset from TRAI to analyze mobile dataspeeds and try to see the average speeds for a particular operator or state in that month. This will also show how easily Pandas could be used on any real world data to derive interesting results.
About Dataset –
Telecom Regulatory Authority of India (TRAI) releases a monthly dataset of the internet speeds it measures through the MySpeed (TRAI) app. This includes speed tests initiated by the user itself or periodic background tests done by the app. We will try to analyze this dataset and see the average speeds for a particular operator or state in that month.
Inspecting the raw structure of data:
- Go to TRAI MySpeed Portal and download the latest month’s csv file under the Download section. You can also download the csv file used in this article: sept18_publish.csv or sept18_publish_drive.csv
- Open this spreadsheet file.
NOTE: As the dataset is huge, the software may give you an warning that all rows could not be loaded. This is fine. Also if you are using Microsoft Excel, there might be a warning about opening of a SYLK file. This error could be ignored as it is a common bug in Excel.
Now, let’s take a look at the arrangement of the data-1st column is of the Network Operator – JIO, Airtel etc.
2nd column is of the Network Technology – 3G or 4G.
3rd column is the Type of Test initiated – upload or download.
4th column is the Speed Measured in Kilobytes per second.
5th column is the Signal Strength during the measurement.
6th column is the Local Service Area(LSA), or the circle where the test was done – Delhi, Orissa etc. We will refer to this as simply ‘states’.
NOTE: The Signal Strength may have na (Not Available)
values due to some devices unable to capture signal. We will ignore using this parameter in our calculations to make things simpler. However, it could be easily added as a condition while filtering.
Packages required –
Pandas – a popular data analysis toolkit. Very powerful for crunching large sets of data.
Numpy – provides fast and efficient operations on arrays of homogeneous data. We will use this to along with pandas and matplotlib.
Matplotlib – is a plotting library. We will use its bar plotting function to make bar graphs.
Lets start analyzing the data.
Step #1: Import the packages and define some constants.
import pandas as pd import numpy as np import matplotlib.pyplot as plt # we will define some constants # name of the csv dataset DATASET_FILENAME = 'sept18_publish.csv' # define the operator to be filtered upon. CONST_OPERATOR = 'JIO' # define the state to be filtered upon. CONST_STATE = 'Delhi' # define the the technology to be filtered upon CONST_TECHNOLOGY = '4G' |
Step #2: Define some lists that will store the final calculated results, so that it could be passed on to the bar plotting function easily. The state (or operator), download speed and upload speed will be stored serially so that for an index, the state (or operator), it’s corresponding download and upload speeds can be accessed.
For Example, final_states[2], final_download_speeds[2]
and final_upload_speeds[2]
will give the corresponding values for the 3rd state.
# define lists final_download_speeds = [] final_upload_speeds = [] final_states = [] final_operators = [] |
Step #3: Import the file using Pandas read_csv()
function and store it in ‘df’. This will create a DataFrame of the csv contents on which we will work on.
df = pd.read_csv(DATASET_FILENAME) # assign headers for each of the columns based on the data # this allows us to access columns easily df.columns = [ 'Service Provider' , 'Technology' , 'Test Type' , 'Data Speed' , 'Signal Strength' , 'State' ] |
Step #4: First lets find all the unique states and operators in this dataset and store them into corresponding states and operators list.
We will use the unique()
method of the Pandas dataframe.
# find and display the unique states states = df[ 'State' ].unique() print ( 'STATES Found: ' , states) # find and display the unique operators operators = df[ 'Service Provider' ].unique() print ( 'OPERATORS Found: ' , operators) |
Output:
STATES Found: ['Kerala' 'Rajasthan' 'Maharashtra' 'UP East' 'Karnataka' nan 'Madhya Pradesh' 'Kolkata' 'Bihar' 'Gujarat' 'UP West' 'Orissa' 'Tamil Nadu' 'Delhi' 'Assam' 'Andhra Pradesh' 'Haryana' 'Punjab' 'North East' 'Mumbai' 'Chennai' 'Himachal Pradesh' 'Jammu & Kashmir' 'West Bengal'] OPERATORS Found: ['IDEA' 'JIO' 'AIRTEL' 'VODAFONE' 'CELLONE']
Step #5: Define the function fixed_operator
, which will keep the operator constant and iterate through all the available states for that operator. We can construct a similar function for a fixed state.
# filter out the operator and technology # first as this will be common for all filtered = df[(df[ 'Service Provider' ] = = CONST_OPERATOR) & (df[ 'Technology' ] = = CONST_TECHNOLOGY)] # iterate through each of the states for state in states: # create new dataframe which contains # only the data of the current state base = filtered[filtered[ 'State' ] = = state] # filter only download speeds based on test type down = base[base[ 'Test Type' ] = = 'download' ] # filter only upload speeds based on test type up = base[base[ 'Test Type' ] = = 'upload' ] # calculate mean of speeds in Data Speed # column using the Pandas.mean() method avg_down = down[ 'Data Speed' ].mean() # calculate mean of speeds # in Data Speed column avg_up = up[ 'Data Speed' ].mean() # discard values if mean is not a number(nan) # and append only the valid ones if (pd.isnull(avg_down) or pd.isnull(avg_up)): down, up = 0 , 0 else : final_states.append(state) final_download_speeds.append(avg_down) final_upload_speeds.append(avg_up) # print output upto 2 decimal places print ( str (state) + ' -- Avg. Download: ' + str ( '%.2f' % avg_down) + ' Avg. Upload: ' + str ( '%.2f' % avg_up)) |
Output:
Kerala -- Avg. Download: 26129.27 Avg. Upload: 5193.46 Rajasthan -- Avg. Download: 27784.86 Avg. Upload: 5736.18 Maharashtra -- Avg. Download: 20707.88 Avg. Upload: 4130.46 UP East -- Avg. Download: 22451.35 Avg. Upload: 5727.95 Karnataka -- Avg. Download: 16950.36 Avg. Upload: 4720.68 Madhya Pradesh -- Avg. Download: 23594.85 Avg. Upload: 4802.89 Kolkata -- Avg. Download: 26747.80 Avg. Upload: 5655.55 Bihar -- Avg. Download: 31730.54 Avg. Upload: 6599.45 Gujarat -- Avg. Download: 16377.43 Avg. Upload: 3642.89 UP West -- Avg. Download: 23720.82 Avg. Upload: 5280.46 Orissa -- Avg. Download: 31502.05 Avg. Upload: 6895.46 Tamil Nadu -- Avg. Download: 16689.28 Avg. Upload: 4107.44 Delhi -- Avg. Download: 20308.30 Avg. Upload: 4877.40 Assam -- Avg. Download: 5653.49 Avg. Upload: 2864.47 Andhra Pradesh -- Avg. Download: 32444.07 Avg. Upload: 5755.95 Haryana -- Avg. Download: 7170.63 Avg. Upload: 2680.02 Punjab -- Avg. Download: 14454.45 Avg. Upload: 4981.15 North East -- Avg. Download: 6702.29 Avg. Upload: 2966.84 Mumbai -- Avg. Download: 14070.97 Avg. Upload: 4118.21 Chennai -- Avg. Download: 20054.47 Avg. Upload: 4602.35 Himachal Pradesh -- Avg. Download: 7436.99 Avg. Upload: 4020.09 Jammu & Kashmir -- Avg. Download: 8759.20 Avg. Upload: 4418.21 West Bengal -- Avg. Download: 16821.17 Avg. Upload: 3628.78
Plotting the data –
Use the arange()
method of Numpy which returns evenly spaced values within a given interval. Here, passing the length of the final_states
list, hence we get values from 0 to the number of states in the list like [0, 1, 2, 3 …]
We can then use these indices to plot a bar at that location. The second bar is plotted by offsetting the location of the first bar by the bar width.
fig, ax = plt.subplots() # the width of each bar bar_width = 0.25 # opacity of each bar opacity = 0.8 # store the positions index = np.arange( len (final_states)) # the plt.bar() takes in the position # of the bars, data to be plotted, # width of each bar and some other # optional parameters, like the opacity and colour # plot the download bars bar_download = plt.bar(index, final_download_speeds, bar_width, alpha = opacity, color = 'b' , label = 'Download' ) # plot the upload bars bar_upload = plt.bar(index + bar_width, final_upload_speeds, bar_width, alpha = opacity, color = 'g' , label = 'Upload' ) # title of the graph plt.title( 'Avg. Download/Upload speed for ' + str (CONST_OPERATOR)) # the x-axis label plt.xlabel( 'States' ) # the y-axis label plt.ylabel( 'Average Speeds in Kbps' ) # the label below each of the bars, # corresponding to the states plt.xticks(index + bar_width, final_states, rotation = 90 ) # draw the legend plt.legend() # make the graph layout tight plt.tight_layout() # show the graph plt.show() |
Comparing data of two months –
Let’s take some another month’s data as well and plot them together to observe the difference in the data speeds.
For this example, the previous month’s dataset will be same sept18_publish.csv and the next month’s dataset is oct18_publish.csv.
We just need to execute the same steps again. Read the another month’s data. Filter it out to subsequent dataframes and then plot it using a slightly different method. During plotting of the bars, we will increment the 3rd and 4th bars(corresponding to the second file’s upload and download) by 2 and 3 times the bar width, so that they are in their correct positions.
Below is the implementation for comparing 2-months of data:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import time # older month # newer month CONST_OPERATOR = 'JIO' CONST_STATE = 'Delhi' CONST_TECHNOLOGY = '4G' # read file with Pandas and store as Dataframe df = pd.read_csv(DATASET_FILENAME) df2 = pd.read_csv(DATASET_FILENAME2) # assign column names df.columns = [ 'Service Provider' , 'Technology' , 'Test Type' , 'Data Speed' , 'Signal Strength' , 'State' ] df2.columns = [ 'Service Provider' , 'Technology' , 'Test Type' , 'Data Speed' , 'Signal Strength' , 'State' ] # find and display the unique states states = df[ 'State' ].unique() print ( 'STATES Found: ' , states) # find and display the unique operators operators = df[ 'Service Provider' ].unique() print ( 'OPERATORS Found: ' , operators) # define lists final_download_speeds = [] final_upload_speeds = [] final_download_speeds_second = [] final_upload_speeds_second = [] final_states = [] final_operators = [] # assign column names to the data df.columns = [ 'Service Provider' , 'Technology' , 'Test Type' , 'Data Speed' , 'Signal Strength' , 'State' ] df2.columns = [ 'Service Provider' , 'Technology' , 'Test Type' , 'Data Speed' , 'Signal Strength' , 'State' ] print ( '\n\nComparing data for' + str (CONST_OPERATOR)) filtered = df[(df[ 'Service Provider' ] = = CONST_OPERATOR) & (df[ 'Technology' ] = = CONST_TECHNOLOGY)] filtered2 = df2[(df2[ 'Service Provider' ] = = CONST_OPERATOR) & (df2[ 'Technology' ] = = CONST_TECHNOLOGY)] for state in states: base = filtered[filtered[ 'State' ] = = state] # calculate mean of download speeds avg_down = base[base[ 'Test Type' ] = = 'download' ][ 'Data Speed' ].mean() # calculate mean of upload speeds avg_up = base[base[ 'Test Type' ] = = 'upload' ][ 'Data Speed' ].mean() base2 = filtered2[filtered2[ 'State' ] = = state] # calculate mean of download speeds avg_down2 = base2[base2[ 'Test Type' ] = = 'download' ][ 'Data Speed' ].mean() # calculate mean of upload speeds avg_up2 = base2[base2[ 'Test Type' ] = = 'upload' ][ 'Data Speed' ].mean() # discard values if mean is not a number(nan) # and append only the needed speeds if (pd.isnull(avg_down) or pd.isnull(avg_up) or pd.isnull(avg_down2) or pd.isnull(avg_up2)): avg_down = 0 avg_up = 0 avg_down2 = 0 avg_up2 = 0 else : final_states.append(state) final_download_speeds.append(avg_down) final_upload_speeds.append(avg_up) final_download_speeds_second.append(avg_down2) final_upload_speeds_second.append(avg_up2) print ( 'Older: ' + str (state) + ' -- Download: ' + str ( '%.2f' % avg_down) + ' Upload: ' + str ( '%.2f' % avg_up)) print ( 'Newer: ' + str (state) + ' -- Download: ' + str ( '%.2f' % avg_down2) + ' Upload: ' + str ( '%.2f' % avg_up2)) # plot bargraph fig, ax = plt.subplots() index = np.arange( len (final_states)) bar_width = 0.2 opacity = 0.8 rects1 = plt.bar(index, final_download_speeds, bar_width, alpha = opacity, color = 'b' , label = 'Older Month\'s Download' ) rects2 = plt.bar(index + bar_width, final_upload_speeds, bar_width, alpha = opacity, color = 'g' , label = 'Older Month\'s Upload' ) rects3 = plt.bar(index + 2 * bar_width, final_download_speeds_second, bar_width, alpha = opacity, color = 'y' , label = 'Newer Month\'s Download' ) rects4 = plt.bar(index + 3 * bar_width, final_upload_speeds_second, bar_width, alpha = opacity, color = 'r' , label = 'Newer Month\'s Upload' ) plt.xlabel( 'States' ) plt.ylabel( 'Average Speeds' ) plt.title( 'Avg. Download/Upload speed for ' + str (CONST_OPERATOR)) plt.xticks(index + bar_width, final_states, rotation = 90 ) plt.legend() plt.tight_layout() plt.show() |
Output:
STATES Found: ['Kerala' 'Rajasthan' 'Maharashtra' 'UP East' 'Karnataka' nan 'Madhya Pradesh' 'Kolkata' 'Bihar' 'Gujarat' 'UP West' 'Orissa' 'Tamil Nadu' 'Delhi' 'Assam' 'Andhra Pradesh' 'Haryana' 'Punjab' 'North East' 'Mumbai' 'Chennai' 'Himachal Pradesh' 'Jammu & Kashmir' 'West Bengal'] OPERATORS Found: ['IDEA' 'JIO' 'AIRTEL' 'VODAFONE' 'CELLONE']
Comparing data forJIO Older: Kerala -- Download: 26129.27 Upload: 5193.46 Newer: Kerala -- Download: 18917.46 Upload: 4290.13 Older: Rajasthan -- Download: 27784.86 Upload: 5736.18 Newer: Rajasthan -- Download: 13973.66 Upload: 4721.17 Older: Maharashtra -- Download: 20707.88 Upload: 4130.46 Newer: Maharashtra -- Download: 26285.47 Upload: 5848.77 Older: UP East -- Download: 22451.35 Upload: 5727.95 Newer: UP East -- Download: 24368.81 Upload: 6101.20 Older: Karnataka -- Download: 16950.36 Upload: 4720.68 Newer: Karnataka -- Download: 33521.31 Upload: 5871.38 Older: Madhya Pradesh -- Download: 23594.85 Upload: 4802.89 Newer: Madhya Pradesh -- Download: 16614.49 Upload: 4135.70 Older: Kolkata -- Download: 26747.80 Upload: 5655.55 Newer: Kolkata -- Download: 23761.85 Upload: 5153.29 Older: Bihar -- Download: 31730.54 Upload: 6599.45 Newer: Bihar -- Download: 34196.09 Upload: 5215.58 Older: Gujarat -- Download: 16377.43 Upload: 3642.89 Newer: Gujarat -- Download: 9557.90 Upload: 2684.55 Older: UP West -- Download: 23720.82 Upload: 5280.46 Newer: UP West -- Download: 35035.84 Upload: 5797.93 Older: Orissa -- Download: 31502.05 Upload: 6895.46 Newer: Orissa -- Download: 31826.96 Upload: 6968.59 Older: Tamil Nadu -- Download: 16689.28 Upload: 4107.44 Newer: Tamil Nadu -- Download: 27306.54 Upload: 5537.58 Older: Delhi -- Download: 20308.30 Upload: 4877.40 Newer: Delhi -- Download: 25198.16 Upload: 6228.81 Older: Assam -- Download: 5653.49 Upload: 2864.47 Newer: Assam -- Download: 5243.34 Upload: 2676.69 Older: Andhra Pradesh -- Download: 32444.07 Upload: 5755.95 Newer: Andhra Pradesh -- Download: 19898.16 Upload: 4002.25 Older: Haryana -- Download: 7170.63 Upload: 2680.02 Newer: Haryana -- Download: 8496.27 Upload: 2862.61 Older: Punjab -- Download: 14454.45 Upload: 4981.15 Newer: Punjab -- Download: 17960.28 Upload: 4885.83 Older: North East -- Download: 6702.29 Upload: 2966.84 Newer: North East -- Download: 6008.06 Upload: 3052.87 Older: Mumbai -- Download: 14070.97 Upload: 4118.21 Newer: Mumbai -- Download: 26898.04 Upload: 5539.71 Older: Chennai -- Download: 20054.47 Upload: 4602.35 Newer: Chennai -- Download: 36086.70 Upload: 6675.70 Older: Himachal Pradesh -- Download: 7436.99 Upload: 4020.09 Newer: Himachal Pradesh -- Download: 9277.45 Upload: 4622.25 Older: Jammu & Kashmir -- Download: 8759.20 Upload: 4418.21 Newer: Jammu & Kashmir -- Download: 9290.38 Upload: 4533.08 Older: West Bengal -- Download: 16821.17 Upload: 3628.78 Newer: West Bengal -- Download: 9763.05 Upload: 2627.28
We just learned how to analyze some real world data and draw some interesting observations from it. But note that not all the data will be as nicely formatted and simple to deal with, Pandas makes it incredibly easy to work with such datasets.