Although data about faculty salaries at private universities can be difficult to find, getting data regarding faculty salaries at public universities is much easier. I’ve always been curious (even though I don’t want to go into academia myself!) which fields pay the best and why and if there is a way to predict your salary if you are going to be a professor at a certain university.
So, I decided to make this a project I would explore and use two of the very best data science tools to do it: Python and R! Thanks to the rpy2 library and IPython’s R magic, I can now include R code inside the notebook instead of having to switch between knitr and here. The first half will be about cleaning and EDA (exploratory data analysis): a key first step of any data science project dealing with structured data especially. The second half will be about machine learning and applying a model for regression.
While Python is making improvements in visualization, I still prefer R for this part. Combining the great tools of the plyr library for splitting the data into pieces with the beautiful visualization of the ggplot2 library makes this look much better than Python’s current offerings.
Getting the Data
For this project, I decided to choose my current university: Texas A&M. It’s a large university, which means it also has a lot of faculty members to look at! The Texas Tribune publishes salary data each year from the university here.
We only need the data from Texas A&M, no other extension agencies.
Here is part of the problem though: all of the files we can download are .xlsx Excel files. R doesn’t read these very well (although there are some workarounds). The best way is to save them as a .csv file first, then read them in using read.csv.
Pandas can read Excel files, however. Let’s get the files using Pandas, then save them as .csv files locally. Once that is done, we can load them into R.
import pandas as pd
texasAM_file = pd.read_excel('http://raw.texastribune.org.s3.amazonaws.com' +
'/texas_a%26m_university_system/salaries/2014-01/Hill%20-%20SO-14-003.xlsx') # Download the file.
texasAM_file.to_csv('TexasAM_Salaries.csv') # Save the data as a .csv
Cleaning It Up
It is time to play around with some of IPython Notebook’s lovely interaction with R using the rpy2 library and the R magic. Note to Windows users: this probably won’t work very well for you as rpy2 doesn’t play nicely with Windows! You could use a virtual machine with Linux to get around this if you wish.
Let’s load all of the libraries we will need in R first.
%load_ext rpy2.ipython
%%R # You need this to use a block of R code.
library(plyr)
library(ggplot2)
library(scales)
Now that we have all of our libraries loaded, let’s load our data into R and take a look at it.
%%R
salaryDB <- read.csv('TexasAM_Salaries.csv', header = TRUE, stringsAsFactors = FALSE)
print(head(salaryDB))
X MbrName LastName FirstName ShortTitleDesc
1 0 Prairie View A&M University ABBASI HASSAN INFO RESOURCE SUP
2 1 Prairie View A&M University ABBASI HUSSAIN INFO RESOURCE SUP
3 2 Prairie View A&M University ABIBO IBUOMI SPEC WEB CONTENT
4 3 Prairie View A&M University ABROM-JOHNSON TENEINGER ADJUNCT INSTRUCTOR
5 4 Prairie View A&M University ADAMS BETTY PROF & DEAN
6 5 Prairie View A&M University ADAMS GLORIA ADMIN ASST III
ShortAdlocDesc EEOMinorityCode Sex CurrEmplDate
1 INFORMATION RESOURCE 4 M 2009-05-15 00:00:00
2 ADMIN & RECORDS 4 M 2010-07-15 00:00:00
3 ADLOC COOP EXTENSION 2 M 2013-02-15 00:00:00
4 AG. NUTR & HUMAN ECO 2 F 1990-12-17 00:00:00
5 OFF.OF ACADEMIC AFF 2 F 2000-06-01 00:00:00
6 UNIVERSITY POLICE 2 F 1988-11-01 00:00:00
OrigEmplDate BudgetedSalary
1 2009-05-15 00:00:00 63664
2 2010-07-15 00:00:00 61800
3 2009-02-15 00:00:00 50364
4 1990-12-17 00:00:00 44378
5 2000-06-01 00:00:00 153372
6 1974-06-01 00:00:00 45968
Clearly there are some issues we will need to address. First of all, we can see other universities were included in the database that are NOT a part of Texas A&M. For this study we are only looking at Texas A&M faculty, so let’s only include rows under the ‘MbrName’ column that are either part of Texas A&M or the Health Science Center. Let’s also trim this down a bit and get rid of the ‘X’ and ‘CurrEmplDate’ columns as we won’t need them for our analysis.
%%R
salaryDB <- subset(salaryDB, MbrName == 'Texas A&M University' | MbrName == 'Texas A&M Health Science Center',
select = -c(CurrEmplDate, X))
Let’s see a summary of our datatypes.
%%R
str(salaryDB)
'data.frame': 9234 obs. of 9 variables:
$ MbrName : chr "Texas A&M Health Science Center" "Texas A&M Health Science Center" "Texas A&M Health Science Center" "Texas A&M Health Science Center" ...
$ LastName : chr "ABBEY" "ABRAHAM" "ACOSTA" "ADAMS" ...
$ FirstName : chr "COLETTE" "CELESTE" "IDALIA" "JOYCE" ...
$ ShortTitleDesc : chr "RESEARCH ASSOCIATE" "ASSOCIATE PROFESSOR" "ADMIN COORDINATOR" "BUS ADMIN I" ...
$ ShortAdlocDesc : chr "MOLECULAR & CELL MED" "PERIODONTICS" "HEALTH POL & MGMT" "REGENERATIVE MED" ...
$ EEOMinorityCode: num 1 4 3 1 3 1 1 1 4 3 ...
$ Sex : chr "F" "F" "F" "F" ...
$ OrigEmplDate : chr "1992-09-10 00:00:00" "1993-06-21 00:00:00" "1995-08-14 00:00:00" "2008-09-15 00:00:00" ...
$ BudgetedSalary : int 41600 107362 55782 52500 54000 185001 5111 35144 40315 15900 ...
For this project, we are looking at tenure-track faculty only. We want to get rid of other employees such as “BUS ADMIN I”. As a start, let’s remove all employees that don’t have “Professor” in their title (from the ShortTitleDesc field).
%%R
salaryDB <- salaryDB[grep('PROF', salaryDB$ShortTitleDesc),]
dim(salaryDB)
[1] 2818 9
We have reduced the number of employees a lot, from 9234 to 2818.
The next thing we want to do is change the OrigEmplDate column to something more meaningful. Instead, let’s make it the number of years they have worked at the university. We can use this as one of our features later.
The metadata at the website where we obtained the original salary data says that it was uploaded on Jan 15, 2014. Let’s find the difference between that date and the original employment date for all of our employees.
%%R
origDate <- as.Date('2014-01-15') # This is the format R likes: YYYY-MM-DD.
salaryDB$OrigEmplDate <- as.Date(salaryDB$OrigEmplDate) # Get this into the Date form R likes for finding differences
# between dates.
salaryDB$OrigEmplDate <- as.numeric((origDate-salaryDB$OrigEmplDate)/365.) # Calculate the difference and divide by 365 for number of years.
colnames(salaryDB)[8] <- 'YearsWorked' # Use a more informative column title.
Let’s just check to make sure it worked.
%%R
head(salaryDB)
MbrName LastName FirstName ShortTitleDesc
5892 Texas A&M Health Science Center ABRAHAM CELESTE ASSOCIATE PROFESSOR
5896 Texas A&M Health Science Center ADAMS RAE ASSISTANT PROFESSOR
5897 Texas A&M Health Science Center ADAMS TERRY CLINICAL ASST PROF
5905 Texas A&M Health Science Center AL ROUSAN RABAA ASSISTANT PROFESSOR
5906 Texas A&M Health Science Center ALAJLOUNI KHALDOUN CLIN ASSOC PROF
5908 Texas A&M Health Science Center ALANIZ ROBERT ASSISTANT PROFESSOR
ShortAdlocDesc EEOMinorityCode Sex YearsWorked BudgetedSalary
5892 PERIODONTICS 4 F 20.5835616 107362
5896 TX A&M FAMILY MED RE 1 F 10.5890411 185001
5897 ORTHODONTICS 1 M 17.3205479 5111
5905 PHARMACY-ACAD SUPP NA F 0.2712329 92500
5906 RESTORATIVE SCIENCES 1 M 8.0438356 16840
5908 MICRO PATH & IMMUNOL 3 M 5.9589041 88642
That seems correct. The next thing we are going to do is create another feature: college. All of the departments at universities are organized into colleges. At Texas A&M, that means 12 major colleges. This is somewhat tedious work, but we want to organize all of the faculty members by college as a feature.
Let’s get a list of all of the different departments included first.
%%R
colnames(salaryDB)[5] <- 'Dept' # Much better title for this column!
deptLevels <- as.factor(salaryDB$Dept)
print(summary(deptLevels))
MATHEMATICS HEALTH & KINESIOLOGY QATAR CAMPUS
102 96 66
PHYSICS & ASTRONOMY TAMU SCHOOL OF LAW DEPT HEAD-ELECT ENGR
64 62 58
CIVIL ENGINEERING RESTORATIVE SCIENCES LIBRARY DIRECTOR
57 54 51
ENGLISH CHEMISTRY DEPT HEAD-MECH ENGR
48 47 47
EDUCATIONAL PSYCHOLO HISTORY BIOLOGY
47 47 46
VET INTEGRATIVE BIOS VET MED PATHOBIOLOGY PSYCHOLOGY
44 43 41
PHARMACY-ACAD SUPP TEACHING, LEARNING ARCHITECTURE
38 35 34
BUSH SCHOOL OF GOV POLITICAL SCIENCE VET MED SMALL ANIMAL
34 34 34
DEPT OF STATISTICS HD, COMPUTER SCI MANAGEMENT
33 33 33
ANIMAL SCIENCE LARGE ANIMAL CLINICA SOCIOLOGY
31 31 29
COLLEGE OF NURSING PEDIATRIC DENTISTRY AEROSPACE ENG
28 28 27
BIOCHEM & BIOPHYSICS DEPT HEAD-PETRL ENGR DEPT HEAD-IND ENGR
27 27 26
ECONOMICS GEOLOGY HEAD OF ACCOUNTING
26 26 26
AG ECONOMICS ANTHROPOLOGY DEPT HEAD-ENGR TECH
25 25 25
DEPT INFO & OP MGMNT LANDS ARCH URB PLAN ED ADM & HR (EAHR)
25 25 23
ENTOMOLOGY SOIL & CROP SCIENCE VET MED PHYS & PHAR
23 23 23
ECOSYSTEM SCI & MGMT PHILOSO & HUMANITIES WILDLIFE & FISHERIES
22 22 22
BIOLOG & AG ENGINEER DEAN OF