Thursday, December 26, 2024
Google search engine
HomeLanguagesAnalyzing Mobile Data Speeds from TRAI with Pandas

Analyzing Mobile Data Speeds from TRAI with Pandas

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.

Let’s use a real dataset from TRAI to analyze mobile dataspeeds and try to see the average speeds for a particular operator or state in that month. This will also show how easily Pandas could be used on any real world data to derive interesting results.

About Dataset –
Telecom Regulatory Authority of India (TRAI) releases a monthly dataset of the internet speeds it measures through the MySpeed (TRAI) app. This includes speed tests initiated by the user itself or periodic background tests done by the app. We will try to analyze this dataset and see the average speeds for a particular operator or state in that month.

Inspecting the raw structure of data:

  • Go to TRAI MySpeed Portal and download the latest month’s csv file under the Download section. You can also download the csv file used in this article: sept18_publish.csv or sept18_publish_drive.csv

  • Open this spreadsheet file.
    NOTE: As the dataset is huge, the software may give you an warning that all rows could not be loaded. This is fine. Also if you are using Microsoft Excel, there might be a warning about opening of a SYLK file. This error could be ignored as it is a common bug in Excel.
    Now, let’s take a look at the arrangement of the data-

    Column Names

    Column Names in the dataset

    1st column is of the Network OperatorJIO, Airtel etc.
    2nd column is of the Network Technology3G or 4G.
    3rd column is the Type of Test initiated – upload or download.
    4th column is the Speed Measured in Kilobytes per second.
    5th column is the Signal Strength during the measurement.
    6th column is the Local Service Area(LSA), or the circle where the test was done – Delhi, Orissa etc. We will refer to this as simply ‘states’.

  • NOTE: The Signal Strength may have na (Not Available) values due to some devices unable to capture signal. We will ignore using this parameter in our calculations to make things simpler. However, it could be easily added as a condition while filtering.

    Packages required –

    Pandas – a popular data analysis toolkit. Very powerful for crunching large sets of data.
    Numpy – provides fast and efficient operations on arrays of homogeneous data. We will use this to along with pandas and matplotlib.
    Matplotlib – is a plotting library. We will use its bar plotting function to make bar graphs.

    Lets start analyzing the data.

    Step #1: Import the packages and define some constants.




    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
      
    # we will define some constants
      
    # name of the csv dataset
    DATASET_FILENAME = 'sept18_publish.csv'
      
    # define the operator to be filtered upon.
    CONST_OPERATOR = 'JIO'
      
    # define the state to be filtered upon.
    CONST_STATE = 'Delhi'
      
    # define the the technology to be filtered upon
    CONST_TECHNOLOGY = '4G'

    
    

     
    Step #2: Define some lists that will store the final calculated results, so that it could be passed on to the bar plotting function easily. The state (or operator), download speed and upload speed will be stored serially so that for an index, the state (or operator), it’s corresponding download and upload speeds can be accessed.

    For Example, final_states[2], final_download_speeds[2] and final_upload_speeds[2] will give the corresponding values for the 3rd state.




    # define lists
    final_download_speeds = []
    final_upload_speeds = []
    final_states = []
    final_operators = []

    
    

    Step #3: Import the file using Pandas read_csv() function and store it in ‘df’. This will create a DataFrame of the csv contents on which we will work on.




    df = pd.read_csv(DATASET_FILENAME)
      
    # assign headers for each of the columns based on the data
    # this allows us to access columns easily
      
    df.columns = ['Service Provider', 'Technology', 'Test Type',
                       'Data Speed', 'Signal Strength', 'State']

    
    

    Step #4: First lets find all the unique states and operators in this dataset and store them into corresponding states and operators list.

    We will use the unique() method of the Pandas dataframe.




    # find and display the unique states
    states = df['State'].unique()
    print('STATES Found: ', states)
      
    # find and display the unique operators
    operators = df['Service Provider'].unique()
    print('OPERATORS Found: ', operators)

    
    

    Output:

STATES Found:  ['Kerala' 'Rajasthan' 'Maharashtra' 'UP East' 'Karnataka' nan
 'Madhya Pradesh' 'Kolkata' 'Bihar' 'Gujarat' 'UP West' 'Orissa'
 'Tamil Nadu' 'Delhi' 'Assam' 'Andhra Pradesh' 'Haryana' 'Punjab'
 'North East' 'Mumbai' 'Chennai' 'Himachal Pradesh' 'Jammu & Kashmir'
 'West Bengal']
OPERATORS Found:  ['IDEA' 'JIO' 'AIRTEL' 'VODAFONE' 'CELLONE']

 

Step #5: Define the function fixed_operator, which will keep the operator constant and iterate through all the available states for that operator. We can construct a similar function for a fixed state.




# filter out the operator and technology
# first as this will be common for all
filtered = df[(df['Service Provider'] == CONST_OPERATOR) 
               & (df['Technology'] == CONST_TECHNOLOGY)]
  
# iterate through each of the states
for state in states:
  
    # create new dataframe which contains
    # only the data of the current state
    base = filtered[filtered['State'] == state]
  
    # filter only download speeds based on test type
    down = base[base['Test Type'] == 'download']
  
    # filter only upload speeds based on test type
    up = base[base['Test Type'] == 'upload']
  
    # calculate mean of speeds in Data Speed
    # column using the Pandas.mean() method
    avg_down = down['Data Speed'].mean()
  
    # calculate mean of speeds
    # in Data Speed column
    avg_up = up['Data Speed'].mean()
  
    # discard values if mean is not a number(nan)
    # and append only the valid ones
    if (pd.isnull(avg_down) or pd.isnull(avg_up)):
        down, up = 0, 0
      
    else:
        final_states.append(state)
        final_download_speeds.append(avg_down)
        final_upload_speeds.append(avg_up)
  
        # print output upto 2 decimal places
        print(str(state) + ' -- Avg. Download: ' +
                          str('%.2f' % avg_down) + 
         '  Avg. Upload: ' + str('%.2f' % avg_up))


Output:

Kerala -- Avg. Download: 26129.27  Avg. Upload: 5193.46
Rajasthan -- Avg. Download: 27784.86  Avg. Upload: 5736.18
Maharashtra -- Avg. Download: 20707.88  Avg. Upload: 4130.46
UP East -- Avg. Download: 22451.35  Avg. Upload: 5727.95
Karnataka -- Avg. Download: 16950.36  Avg. Upload: 4720.68
Madhya Pradesh -- Avg. Download: 23594.85  Avg. Upload: 4802.89
Kolkata -- Avg. Download: 26747.80  Avg. Upload: 5655.55
Bihar -- Avg. Download: 31730.54  Avg. Upload: 6599.45
Gujarat -- Avg. Download: 16377.43  Avg. Upload: 3642.89
UP West -- Avg. Download: 23720.82  Avg. Upload: 5280.46
Orissa -- Avg. Download: 31502.05  Avg. Upload: 6895.46
Tamil Nadu -- Avg. Download: 16689.28  Avg. Upload: 4107.44
Delhi -- Avg. Download: 20308.30  Avg. Upload: 4877.40
Assam -- Avg. Download: 5653.49  Avg. Upload: 2864.47
Andhra Pradesh -- Avg. Download: 32444.07  Avg. Upload: 5755.95
Haryana -- Avg. Download: 7170.63  Avg. Upload: 2680.02
Punjab -- Avg. Download: 14454.45  Avg. Upload: 4981.15
North East -- Avg. Download: 6702.29  Avg. Upload: 2966.84
Mumbai -- Avg. Download: 14070.97  Avg. Upload: 4118.21
Chennai -- Avg. Download: 20054.47  Avg. Upload: 4602.35
Himachal Pradesh -- Avg. Download: 7436.99  Avg. Upload: 4020.09
Jammu & Kashmir -- Avg. Download: 8759.20  Avg. Upload: 4418.21
West Bengal -- Avg. Download: 16821.17  Avg. Upload: 3628.78

Plotting the data –

Use the arange() method of Numpy which returns evenly spaced values within a given interval. Here, passing the length of the final_states list, hence we get values from 0 to the number of states in the list like [0, 1, 2, 3 …]
We can then use these indices to plot a bar at that location. The second bar is plotted by offsetting the location of the first bar by the bar width.




fig, ax = plt.subplots()
   
# the width of each bar
bar_width = 0.25
   
# opacity of each bar
opacity = 0.8
   
# store the positions
index = np.arange(len(final_states))
   
# the plt.bar() takes in the position
# of the bars, data to be plotted,
# width of each bar and some other 
# optional parameters, like the opacity and colour
   
# plot the download bars
bar_download = plt.bar(index, final_download_speeds,
                       bar_width, alpha=opacity,
                       color='b', label='Download')
# plot the upload bars  
bar_upload = plt.bar(index + bar_width, final_upload_speeds, 
                        bar_width, alpha=opacity, color='g',
                                             label='Upload')
   
# title of the graph
plt.title('Avg. Download/Upload speed for '
                     + str(CONST_OPERATOR))
   
# the x-axis label
plt.xlabel('States')
   
# the y-axis label
plt.ylabel('Average Speeds in Kbps')
   
# the label below each of the bars,
# corresponding to the states
plt.xticks(index + bar_width, final_states, rotation=90)
   
# draw the legend
plt.legend()
   
# make the graph layout tight
plt.tight_layout()
   
# show the graph
plt.show()


Barplot Output

Bar plot of the calculated speeds

Comparing data of two months –

Let’s take some another month’s data as well and plot them together to observe the difference in the data speeds.

For this example, the previous month’s dataset will be same sept18_publish.csv and the next month’s dataset is oct18_publish.csv.

We just need to execute the same steps again. Read the another month’s data. Filter it out to subsequent dataframes and then plot it using a slightly different method. During plotting of the bars, we will increment the 3rd and 4th bars(corresponding to the second file’s upload and download) by 2 and 3 times the bar width, so that they are in their correct positions.

Bar Widths in Compare

Offsetting logic when plotting 4 bars

Below is the implementation for comparing 2-months of data:




import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
   
# older month
   
# newer month
   
CONST_OPERATOR = 'JIO'
CONST_STATE = 'Delhi'
CONST_TECHNOLOGY = '4G'
   
# read file with Pandas and store as Dataframe
df = pd.read_csv(DATASET_FILENAME)
df2 = pd.read_csv(DATASET_FILENAME2)
   
# assign column names
df.columns = ['Service Provider', 'Technology', 'Test Type',
                   'Data Speed', 'Signal Strength', 'State']
df2.columns = ['Service Provider', 'Technology', 'Test Type',
                    'Data Speed', 'Signal Strength', 'State']
   
# find and display the unique states
states = df['State'].unique()
print('STATES Found: ', states)
   
# find and display the unique operators
operators = df['Service Provider'].unique()
print('OPERATORS Found: ', operators)
   
# define lists
final_download_speeds = []
final_upload_speeds = []
   
final_download_speeds_second =[]
final_upload_speeds_second = []
       
final_states = []
final_operators = []
   
# assign column names to the data
df.columns = ['Service Provider', 'Technology', 'Test Type'
                   'Data Speed', 'Signal Strength', 'State']
df2.columns = ['Service Provider', 'Technology', 'Test Type',
                    'Data Speed', 'Signal Strength', 'State']
   
  
print('\n\nComparing data for' + str(CONST_OPERATOR))
filtered = df[(df['Service Provider'] == CONST_OPERATOR)
               & (df['Technology'] == CONST_TECHNOLOGY)]
  
filtered2 = df2[(df2['Service Provider'] == CONST_OPERATOR)
                 & (df2['Technology'] == CONST_TECHNOLOGY)]
   
for state in states:
    base = filtered[filtered['State'] == state]
  
    # calculate mean of download speeds
    avg_down = base[base['Test Type'] ==
               'download']['Data Speed'].mean()
  
    # calculate mean of upload speeds
    avg_up = base[base['Test Type'] ==
                 'upload']['Data Speed'].mean()
  
    base2 = filtered2[filtered2['State'] == state]
  
    # calculate mean of download speeds
    avg_down2 = base2[base2['Test Type'] == 
             'download']['Data Speed'].mean()
  
    # calculate mean of upload speeds
    avg_up2 = base2[base2['Test Type'] == 
                'upload']['Data Speed'].mean()
  
  
    # discard values if mean is not a number(nan)
    # and append only the needed speeds
    if (pd.isnull(avg_down) or pd.isnull(avg_up) or
        pd.isnull(avg_down2) or pd.isnull(avg_up2)):
        avg_down = 0
        avg_up = 0
        avg_down2 = 0
        avg_up2 = 0
    else:
        final_states.append(state)
        final_download_speeds.append(avg_down)
        final_upload_speeds.append(avg_up)
        final_download_speeds_second.append(avg_down2)
        final_upload_speeds_second.append(avg_up2)
          
        print('Older: ' + str(state) + ' -- Download: ' +
              str('%.2f' % avg_down) + '  Upload: ' +
              str('%.2f' % avg_up))
  
        print('Newer: ' + str(state) + ' -- Download: ' +
              str('%.2f' % avg_down2) + '  Upload: ' +
              str('%.2f' % avg_up2))
  
# plot bargraph
fig, ax = plt.subplots()
index = np.arange(len(final_states))
bar_width = 0.2
opacity = 0.8
  
rects1 = plt.bar(index, final_download_speeds, 
                 bar_width, alpha=opacity, color='b',
                     label='Older Month\'s Download')
  
rects2 = plt.bar(index + bar_width, final_upload_speeds, 
                    bar_width, alpha=opacity, color='g',
                          label='Older Month\'s Upload')
  
rects3 = plt.bar(index + 2 * bar_width, final_download_speeds_second, 
                                 bar_width, alpha=opacity, color='y',
                                     label='Newer Month\'s Download')
  
rects4 = plt.bar(index + 3 * bar_width, final_upload_speeds_second, 
                               bar_width, alpha=opacity, color='r',
                                     label='Newer Month\'s Upload')
  
plt.xlabel('States')
plt.ylabel('Average Speeds')
plt.title('Avg. Download/Upload speed for '
                     + str(CONST_OPERATOR))
  
plt.xticks(index + bar_width, final_states, rotation=90)
plt.legend()
plt.tight_layout()
  
plt.show()


Output:

STATES Found:  ['Kerala' 'Rajasthan' 'Maharashtra' 'UP East' 'Karnataka' nan
 'Madhya Pradesh' 'Kolkata' 'Bihar' 'Gujarat' 'UP West' 'Orissa'
 'Tamil Nadu' 'Delhi' 'Assam' 'Andhra Pradesh' 'Haryana' 'Punjab'
 'North East' 'Mumbai' 'Chennai' 'Himachal Pradesh' 'Jammu & Kashmir'
 'West Bengal']
OPERATORS Found:  ['IDEA' 'JIO' 'AIRTEL' 'VODAFONE' 'CELLONE']
Comparing data forJIO
Older: Kerala -- Download: 26129.27  Upload: 5193.46
Newer: Kerala -- Download: 18917.46  Upload: 4290.13
Older: Rajasthan -- Download: 27784.86  Upload: 5736.18
Newer: Rajasthan -- Download: 13973.66  Upload: 4721.17
Older: Maharashtra -- Download: 20707.88  Upload: 4130.46
Newer: Maharashtra -- Download: 26285.47  Upload: 5848.77
Older: UP East -- Download: 22451.35  Upload: 5727.95
Newer: UP East -- Download: 24368.81  Upload: 6101.20
Older: Karnataka -- Download: 16950.36  Upload: 4720.68
Newer: Karnataka -- Download: 33521.31  Upload: 5871.38
Older: Madhya Pradesh -- Download: 23594.85  Upload: 4802.89
Newer: Madhya Pradesh -- Download: 16614.49  Upload: 4135.70
Older: Kolkata -- Download: 26747.80  Upload: 5655.55
Newer: Kolkata -- Download: 23761.85  Upload: 5153.29
Older: Bihar -- Download: 31730.54  Upload: 6599.45
Newer: Bihar -- Download: 34196.09  Upload: 5215.58
Older: Gujarat -- Download: 16377.43  Upload: 3642.89
Newer: Gujarat -- Download: 9557.90  Upload: 2684.55
Older: UP West -- Download: 23720.82  Upload: 5280.46
Newer: UP West -- Download: 35035.84  Upload: 5797.93
Older: Orissa -- Download: 31502.05  Upload: 6895.46
Newer: Orissa -- Download: 31826.96  Upload: 6968.59
Older: Tamil Nadu -- Download: 16689.28  Upload: 4107.44
Newer: Tamil Nadu -- Download: 27306.54  Upload: 5537.58
Older: Delhi -- Download: 20308.30  Upload: 4877.40
Newer: Delhi -- Download: 25198.16  Upload: 6228.81
Older: Assam -- Download: 5653.49  Upload: 2864.47
Newer: Assam -- Download: 5243.34  Upload: 2676.69
Older: Andhra Pradesh -- Download: 32444.07  Upload: 5755.95
Newer: Andhra Pradesh -- Download: 19898.16  Upload: 4002.25
Older: Haryana -- Download: 7170.63  Upload: 2680.02
Newer: Haryana -- Download: 8496.27  Upload: 2862.61
Older: Punjab -- Download: 14454.45  Upload: 4981.15
Newer: Punjab -- Download: 17960.28  Upload: 4885.83
Older: North East -- Download: 6702.29  Upload: 2966.84
Newer: North East -- Download: 6008.06  Upload: 3052.87
Older: Mumbai -- Download: 14070.97  Upload: 4118.21
Newer: Mumbai -- Download: 26898.04  Upload: 5539.71
Older: Chennai -- Download: 20054.47  Upload: 4602.35
Newer: Chennai -- Download: 36086.70  Upload: 6675.70
Older: Himachal Pradesh -- Download: 7436.99  Upload: 4020.09
Newer: Himachal Pradesh -- Download: 9277.45  Upload: 4622.25
Older: Jammu & Kashmir -- Download: 8759.20  Upload: 4418.21
Newer: Jammu & Kashmir -- Download: 9290.38  Upload: 4533.08
Older: West Bengal -- Download: 16821.17  Upload: 3628.78
Newer: West Bengal -- Download: 9763.05  Upload: 2627.28
Compare - Bar Plot

Bar plot of the output

We just learned how to analyze some real world data and draw some interesting observations from it. But note that not all the data will be as nicely formatted and simple to deal with, Pandas makes it incredibly easy to work with such datasets.

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments