Exploring and Predicting University Faculty Salaries

6 September 2024

0

Although data about faculty salaries at private universities can be difficult to find, getting data regarding faculty salaries at public universities is much easier. I’ve always been curious (even though I don’t want to go into academia myself!) which fields pay the best and why and if there is a way to predict your salary if you are going to be a professor at a certain university.

So, I decided to make this a project I would explore and use two of the very best data science tools to do it: Python and R! Thanks to the rpy2 library and IPython’s R magic, I can now include R code inside the notebook instead of having to switch between knitr and here. The first half will be about cleaning and EDA (exploratory data analysis): a key first step of any data science project dealing with structured data especially. The second half will be about machine learning and applying a model for regression.

While Python is making improvements in visualization, I still prefer R for this part. Combining the great tools of the plyr library for splitting the data into pieces with the beautiful visualization of the ggplot2 library makes this look much better than Python’s current offerings.

Getting the Data

For this project, I decided to choose my current university: Texas A&M. It’s a large university, which means it also has a lot of faculty members to look at! The Texas Tribune publishes salary data each year from the university here.

We only need the data from Texas A&M, no other extension agencies.

Here is part of the problem though: all of the files we can download are .xlsx Excel files. R doesn’t read these very well (although there are some workarounds). The best way is to save them as a .csv file first, then read them in using read.csv.

Pandas can read Excel files, however. Let’s get the files using Pandas, then save them as .csv files locally. Once that is done, we can load them into R.

import pandas as pd
texasAM_file = pd.read_excel('http://raw.texastribune.org.s3.amazonaws.com' + 
    '/texas_a%26m_university_system/salaries/2014-01/Hill%20-%20SO-14-003.xlsx') # Download the file.


texasAM_file.to_csv('TexasAM_Salaries.csv') # Save the data as a .csv

Cleaning It Up

It is time to play around with some of IPython Notebook’s lovely interaction with R using the rpy2 library and the R magic. Note to Windows users: this probably won’t work very well for you as rpy2 doesn’t play nicely with Windows! You could use a virtual machine with Linux to get around this if you wish.

Let’s load all of the libraries we will need in R first.

    %load_ext rpy2.ipython

    %%R # You need this to use a block of R code.
    library(plyr)
    library(ggplot2)
    library(scales)

Now that we have all of our libraries loaded, let’s load our data into R and take a look at it.

%%R
salaryDB <- read.csv('TexasAM_Salaries.csv', header = TRUE, stringsAsFactors = FALSE)
print(head(salaryDB))

    X                     MbrName      LastName FirstName     ShortTitleDesc
1 0 Prairie View A&M University        ABBASI    HASSAN  INFO RESOURCE SUP
2 1 Prairie View A&M University        ABBASI   HUSSAIN  INFO RESOURCE SUP
3 2 Prairie View A&M University         ABIBO    IBUOMI   SPEC WEB CONTENT
4 3 Prairie View A&M University ABROM-JOHNSON TENEINGER ADJUNCT INSTRUCTOR
5 4 Prairie View A&M University         ADAMS     BETTY        PROF & DEAN
6 5 Prairie View A&M University         ADAMS    GLORIA     ADMIN ASST III
        ShortAdlocDesc EEOMinorityCode Sex        CurrEmplDate
1 INFORMATION RESOURCE               4   M 2009-05-15 00:00:00
2      ADMIN & RECORDS               4   M 2010-07-15 00:00:00
3 ADLOC COOP EXTENSION               2   M 2013-02-15 00:00:00
4 AG. NUTR & HUMAN ECO               2   F 1990-12-17 00:00:00
5  OFF.OF ACADEMIC AFF               2   F 2000-06-01 00:00:00
6    UNIVERSITY POLICE               2   F 1988-11-01 00:00:00
            OrigEmplDate BudgetedSalary
1 2009-05-15 00:00:00          63664
2 2010-07-15 00:00:00          61800
3 2009-02-15 00:00:00          50364
4 1990-12-17 00:00:00          44378
5 2000-06-01 00:00:00         153372
6 1974-06-01 00:00:00          45968

Clearly there are some issues we will need to address. First of all, we can see other universities were included in the database that are NOT a part of Texas A&M. For this study we are only looking at Texas A&M faculty, so let’s only include rows under the ‘MbrName’ column that are either part of Texas A&M or the Health Science Center. Let’s also trim this down a bit and get rid of the ‘X’ and ‘CurrEmplDate’ columns as we won’t need them for our analysis.

%%R
salaryDB <- subset(salaryDB, MbrName == 'Texas A&M University' | MbrName == 'Texas A&M Health Science Center', 
                    select = -c(CurrEmplDate, X))

Let’s see a summary of our datatypes.

%%R
str(salaryDB)

'data.frame':	9234 obs. of  9 variables:
    $ MbrName        : chr  "Texas A&M Health Science Center" "Texas A&M Health Science Center" "Texas A&M Health Science Center" "Texas A&M Health Science Center" ...
    $ LastName       : chr  "ABBEY" "ABRAHAM" "ACOSTA" "ADAMS" ...
    $ FirstName      : chr  "COLETTE" "CELESTE" "IDALIA" "JOYCE" ...
    $ ShortTitleDesc : chr  "RESEARCH ASSOCIATE" "ASSOCIATE PROFESSOR" "ADMIN COORDINATOR" "BUS ADMIN I" ...
    $ ShortAdlocDesc : chr  "MOLECULAR & CELL MED" "PERIODONTICS" "HEALTH POL & MGMT" "REGENERATIVE MED" ...
    $ EEOMinorityCode: num  1 4 3 1 3 1 1 1 4 3 ...
    $ Sex            : chr  "F" "F" "F" "F" ...
    $ OrigEmplDate   : chr  "1992-09-10 00:00:00" "1993-06-21 00:00:00" "1995-08-14 00:00:00" "2008-09-15 00:00:00" ...
    $ BudgetedSalary : int  41600 107362 55782 52500 54000 185001 5111 35144 40315 15900 ...

For this project, we are looking at tenure-track faculty only. We want to get rid of other employees such as “BUS ADMIN I”. As a start, let’s remove all employees that don’t have “Professor” in their title (from the ShortTitleDesc field).

%%R
salaryDB <- salaryDB[grep('PROF', salaryDB$ShortTitleDesc),]

dim(salaryDB)

[1] 2818    9

We have reduced the number of employees a lot, from 9234 to 2818.

The next thing we want to do is change the OrigEmplDate column to something more meaningful. Instead, let’s make it the number of years they have worked at the university. We can use this as one of our features later.

The metadata at the website where we obtained the original salary data says that it was uploaded on Jan 15, 2014. Let’s find the difference between that date and the original employment date for all of our employees.

%%R    
origDate <- as.Date('2014-01-15') # This is the format R likes: YYYY-MM-DD.
salaryDB$OrigEmplDate <- as.Date(salaryDB$OrigEmplDate) # Get this into the Date form R likes for finding differences 
                                                                    # between dates.
salaryDB$OrigEmplDate <- as.numeric((origDate-salaryDB$OrigEmplDate)/365.) # Calculate the difference and divide by 365 for number of years.
colnames(salaryDB)[8] <- 'YearsWorked' # Use a more informative column title.

Let’s just check to make sure it worked.

%%R
head(salaryDB)

                              MbrName  LastName FirstName      ShortTitleDesc
5892 Texas A&M Health Science Center   ABRAHAM   CELESTE ASSOCIATE PROFESSOR
5896 Texas A&M Health Science Center     ADAMS       RAE ASSISTANT PROFESSOR
5897 Texas A&M Health Science Center     ADAMS     TERRY  CLINICAL ASST PROF
5905 Texas A&M Health Science Center AL ROUSAN     RABAA ASSISTANT PROFESSOR
5906 Texas A&M Health Science Center ALAJLOUNI  KHALDOUN     CLIN ASSOC PROF
5908 Texas A&M Health Science Center    ALANIZ    ROBERT ASSISTANT PROFESSOR
            ShortAdlocDesc EEOMinorityCode Sex YearsWorked BudgetedSalary
5892         PERIODONTICS               4   F  20.5835616         107362
5896 TX A&M FAMILY MED RE               1   F  10.5890411         185001
5897         ORTHODONTICS               1   M  17.3205479           5111
5905   PHARMACY-ACAD SUPP              NA   F   0.2712329          92500
5906 RESTORATIVE SCIENCES               1   M   8.0438356          16840
5908 MICRO PATH & IMMUNOL               3   M   5.9589041          88642

That seems correct. The next thing we are going to do is create another feature: college. All of the departments at universities are organized into colleges. At Texas A&M, that means 12 major colleges. This is somewhat tedious work, but we want to organize all of the faculty members by college as a feature.

Let’s get a list of all of the different departments included first.

%%R
colnames(salaryDB)[5] <- 'Dept' # Much better title for this column!
deptLevels <- as.factor(salaryDB$Dept)
print(summary(deptLevels))

             MATHEMATICS HEALTH & KINESIOLOGY         QATAR CAMPUS

                    102                   96                   66

    PHYSICS & ASTRONOMY   TAMU SCHOOL OF LAW DEPT HEAD-ELECT ENGR

                    64                   62                   58

    CIVIL ENGINEERING RESTORATIVE SCIENCES     LIBRARY DIRECTOR

                    57                   54                   51

                ENGLISH            CHEMISTRY  DEPT HEAD-MECH ENGR

                    48                   47                   47

   EDUCATIONAL PSYCHOLO              HISTORY              BIOLOGY

                    47                   47                   46

       VET INTEGRATIVE BIOS VET MED PATHOBIOLOGY           PSYCHOLOGY

                      44                   43                   41

      PHARMACY-ACAD SUPP   TEACHING, LEARNING         ARCHITECTURE

                      38                   35                   34

      BUSH SCHOOL OF GOV    POLITICAL SCIENCE VET MED SMALL ANIMAL

                      34                   34                   34

      DEPT OF STATISTICS     HD, COMPUTER SCI           MANAGEMENT

                      33                   33                   33

          ANIMAL SCIENCE LARGE ANIMAL CLINICA            SOCIOLOGY

                      31                   31                   29

      COLLEGE OF NURSING  PEDIATRIC DENTISTRY        AEROSPACE ENG

                      28                   28                   27

    BIOCHEM & BIOPHYSICS DEPT HEAD-PETRL ENGR   DEPT HEAD-IND ENGR

                      27                   27                   26

               ECONOMICS              GEOLOGY   HEAD OF ACCOUNTING

                      26                   26                   26

            AG ECONOMICS         ANTHROPOLOGY  DEPT HEAD-ENGR TECH

                      25                   25                   25

    DEPT INFO & OP MGMNT  LANDS ARCH URB PLAN   ED ADM & HR (EAHR)

                      25                   25                   23

              ENTOMOLOGY  SOIL & CROP SCIENCE  VET MED PHYS & PHAR

                      23                   23                   23

    ECOSYSTEM SCI & MGMT PHILOSO & HUMANITIES WILDLIFE & FISHERIES

                      22                   22                   22

    BIOLOG & AG ENGINEER DEAN OF

Exploring and Predicting University Faculty Salaries

Getting the Data

Cleaning It Up

Run Local AWS Cloud Stack using LocalStack on Linux

Learn Terraform Automation in 3 days using Video Courses

How To Expose Ansible AWX Service using Nginx Ingress

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US