As I am currently a postdoc, I was curious about two things: (1) What is the most frequent type job posted? Who is the target audience of this website? (2) If certain types of statistics jobs had a preferred target range over the academic year?
To do this, I enlisted the help of some wonderful R-packages from Hadley Wickham to help with the gathering of the data (rvest), cleaning the data (stringr, lubridate) and visualizing the data (ggplot2). One caveat about this data is the website only posts the job postings from August 2014 until now. The R code is available below in Rmarkdown and Markdown and in a gist.
For simplicity, I grouped the type of positions into four categories:
1. faculty = tentured or non-tenured faculty position including chairs, deans and department heads.
2. postdoc = postdoctoral fellows
3. lecturer = lecturer or instructor
4. statistican = a statistican whose primary role is data analysis or managing other data analysts.
The majority of statistics jobs posted on the UF website since August 2014 have been faculty positions.
Statistics job postings are fairly uniformly posted Mon-Fri on this UF website.
The frequency of the statistics job postings increase Sept - Nov.
Stephanie Hicks
23 Feb 2015
This Rmd uses the UF Department of Statistics Job Postings website to determine the frequency of faculty, postdoc, lecturer and statistican jobs over the academic year.
One caveat: The website only has data starting from Aug 2014 up until now, so I cannot include the postings over the summer, but I am interested in seeing how these plots differ after including spring and summer of 2015.
library(rvest)
library(stringr)
library(lubridate)
library(ggplot2)
First, we scrape the tables from the UF Statistics Jobs website.
I'm using the rvest
package to parse the html page. The data is contained in
tables in the html pages, so I'm using the html()
and html_table()
functions to parse the html and parse the tables in the html pages,
respectively.
pgs = vector("list", 17)
for(i in 1:17){
jobs <- html(paste0("http://www.stat.ufl.edu/jobs/?page=", i))
pgs[[i]] = do.call(rbind, html_table(jobs))
}
dat = do.call(rbind, pgs)
colnames(dat) = c("Location", "Description", "Date")
These are the top 10 most frequent job description titles.
head(sort(table(dat$Description), decreasing = TRUE), 10)
##
## Assistant Professor Postdoctoral Fellow
## 26 18
## Biostatistician Assistant/Associate/Full Professor
## 17 11
## Statistician Assistant/Associate Professor
## 10 8
## Tenure Track Assistant Professor Postdoctoral Fellowship
## 8 7
## Assistant or Associate Professor Assistant Professor of Statistics
## 6 6
Using the str_detect()
function in the stringr
R package, we can
use regular expressions to subset the data frame for any jobs that match the
pattern "Lecture".
head(dat[str_detect(dat$Description, "Lecture"),])
## Location
## 9 INDIANA UNIVERSITY / BLOOMINGTON CAMPUS
## 15 Mount Holyoke College
## 19 University of Glasgow
## 100 Department of Statistics
## 118 Harvard Statistics
## 119 Harvard Statistics
## Description Date
## 9 Lecturer 02/17/2015
## 15 Visiting Lecturer in Statistics 02/12/2015
## 19 Lecturer / Senior Lecturer / Reader in Statistics 02/10/2015
## 100 Full Time Lecturer Position 12/23/2014
## 118 Lecturer 12/15/2014
## 119 Senior Lecturer 12/15/2014
Because the str_detect()
function can only accept one pattern, we can
use the paste()
function to get around that fact and subset the rows matching
either "Lecture" or "Instructor".
head(dat[str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|')),])
## Location
## 9 INDIANA UNIVERSITY / BLOOMINGTON CAMPUS
## 15 Mount Holyoke College
## 19 University of Glasgow
## 100 Department of Statistics
## 118 Harvard Statistics
## 119 Harvard Statistics
## Description Date
## 9 Lecturer 02/17/2015
## 15 Visiting Lecturer in Statistics 02/12/2015
## 19 Lecturer / Senior Lecturer / Reader in Statistics 02/10/2015
## 100 Full Time Lecturer Position 12/23/2014
## 118 Lecturer 12/15/2014
## 119 Senior Lecturer 12/15/2014
For simplicity, I grouped the data into four categories:
- faculty = tentured or non-tenured faculty position including chairs, deans and department heads.
- postdoc = postdoctoral fellows
- lecturer = lecturer or instructor
- statistican = a statistican whose primary role is data analysis or managing other data analysts.
I_faculty = str_detect(dat$Description, paste(c("Professor", "Tenure", "tenure", "Faculty",
"Assistant", "Chair", "Dean", "Department",
"Head"), collapse='|'))
I_postdoc = str_detect(dat$Description, paste(c("Post", "Fellow"), collapse='|'))
I_lecturer = str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|'))
I_statistician = str_detect(dat$Description, paste(c(ignore.case("Biostatistic"),
"Statistician", "Scientist",
"Staff", "Professional", "Analyst",
ignore.case("Researcher"), "Programmer",
"Research Associate", "Master",
"Manager", "Director", "Investigator",
"Specialist", "Consultant", "VP",
"Bioinformatician", "Biometrician",
"Computational"), collapse='|'))
Now, let's create a new column variable called "Position" with the job titles
dat$Position <- ifelse(I_postdoc, "Postdoc", ifelse(I_faculty, "Faculty",
ifelse(I_lecturer, "Lecturer",
ifelse(I_statistician, "Statistician", "Other"))))
dat[which(dat$Position == "Other"),]
## Location
## 56 IDEAS European training network
## 74 Aerojet Rocketdyne
## 143 Odyssey Reinsurance Company
## 184 NC State University
## 328 Indiana University
## 340 Univeristy of California, Davis
## 420 Applied Research Solutions, Inc.
## 448 Computational Biology
## Description Date Position
## 56 14 Early stage researchers 01/26/2015 Other
## 74 Summer Internship 01/16/2015 Other
## 143 Underwriting Associate 12/02/2014 Other
## 184 Grants Proposal Administrator 11/11/2014 Other
## 328 Bloomington Campus 10/03/2014 Other
## 340 Statistics 09/30/2014 Other
## 420 Test and Evaluation Subject Matter Expert 09/04/2014 Other
## 448 University of Pittsburgh 08/27/2014 Other
We see there are a few descriptions that were not able to be categorized using the regex patterns provided above. We'll use some google-fu next to determine where they belong.
Turns out the "University of Pittsburgh" advertisement is for a postdoc. The "Bloomington Campus" and "Statistics" advertisements are for faculty positions. The "14 Early stage researchers" are for statistician positions. I removed the last four ("Summer Internship", "Underwriting Associate", "Grants Proposal Administrator", "Test and Evaluation Subject Matter Expert") as I don't think they are relevant to the analysis here.
dat[which(dat$Description == "University of Pittsburgh"),]$Position <- "Postdoc"
dat[which(dat$Description %in% c("Bloomington Campus", "Statistics")),]$Position <- "Faculty"
dat[which(dat$Description %in% c("14 Early stage researchers")),]$Position <- "Statistician"
dat = dat[!(dat$Description %in% c("Summer Internship", "Underwriting Associate",
"Grants Proposal Administrator",
"Test and Evaluation Subject Matter Expert")),]
dat[which(dat$Position == "Other"),]
## [1] Location Description Date Position
## <0 rows> (or 0-length row.names)
OK, so now we have dealt with grouping all the positions. Let's use the
lubridate
R package to make the Date column more R friendly. I'm using the
mdy()
function to tell R this column contains dates in the form of
"month/day/year". The month()
function extracts the month from each of the
rows.
table(month(mdy(dat$Date)))
##
## 1 2 8 9 10 11 12
## 53 44 53 96 111 78 48
Let's add a few other columns to our data frame.
dat$Position = factor(dat$Position)
dat$Date = mdy(dat$Date)
dat$month = factor(month(dat$Date, label=TRUE),
levels = c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb"))
dat$dayOfWeek = wday(dat$Date, label = TRUE) # day of week
The frequency job postings by position, day of the week and month:
ggplot(dat, aes(x = Position)) + geom_bar() # frequency job by type
ggplot(dat, aes(x = dayOfWeek)) + geom_bar(position="dodge")
ggplot(dat, aes(x = month)) + geom_bar() # frequency job by month
Job postings by date, day of the week and month (colors represent the type of position).
ggplot(dat, aes(x = Date, fill = Position)) + geom_bar(position="dodge")
ggplot(dat, aes(x = dayOfWeek, fill = Position)) + geom_bar(position="dodge")
ggplot(dat, aes(x = month, fill = Position)) + geom_bar(position="dodge")
Most academic faculty positions are posted Sept-Nov and most postdoc positions are posted after that time period.
--- | |
title: "UF Department of Statistics Job Postings" | |
author: "Stephanie Hicks" | |
date: "23 Feb 2015" | |
output: html_document | |
keep_md: TRUE | |
--- | |
## Purpose | |
This Rmd uses the UF Department of Statistics Job Postings website to determine | |
the frequency of faculty, postdoc, lecturer and statistican jobs over the | |
academic year. | |
One caveat: The website only has data starting from Aug 2014 up until now, | |
so I cannot include the postings over the summer, but I am interested in seeing | |
how these plots differ after including spring and summer of 2015. | |
#### Load libraries | |
```{r, message=FALSE} | |
library(rvest) | |
library(stringr) | |
library(lubridate) | |
library(ggplot2) | |
``` | |
#### Scrape data | |
First, we scrape the tables from the UF Statistics Jobs website. | |
I'm using the `rvest` package to parse the html page. The data is contained in | |
tables in the html pages, so I'm using the `html()` and `html_table()` | |
functions to parse the html and parse the tables in the html pages, | |
respectively. | |
```{r} | |
pgs = vector("list", 17) | |
for(i in 1:17){ | |
jobs <- html(paste0("http://www.stat.ufl.edu/jobs/?page=", i)) | |
pgs[[i]] = do.call(rbind, html_table(jobs)) | |
} | |
dat = do.call(rbind, pgs) | |
colnames(dat) = c("Location", "Description", "Date") | |
``` | |
These are the top 10 most frequent job description titles. | |
```{r} | |
head(sort(table(dat$Description), decreasing = TRUE), 10) | |
``` | |
#### Data Cleaning | |
Using the `str_detect()` function in the `stringr` R package, we can | |
use regular expressions to subset the data frame for any jobs that match the | |
pattern "Lecture". | |
```{r} | |
head(dat[str_detect(dat$Description, "Lecture"),]) | |
``` | |
Because the `str_detect()` function can only accept one pattern, we can | |
use the `paste()` function to get around that fact and subset the rows matching | |
either "Lecture" or "Instructor". | |
```{r} | |
head(dat[str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|')),]) | |
``` | |
For simplicity, I grouped the data into four categories: | |
1. faculty = tentured or non-tenured faculty position including chairs, deans | |
and department heads. | |
2. postdoc = postdoctoral fellows | |
3. lecturer = lecturer or instructor | |
4. statistican = a statistican whose primary role is data analysis or | |
managing other data analysts. | |
```{r} | |
I_faculty = str_detect(dat$Description, paste(c("Professor", "Tenure", "tenure", "Faculty", | |
"Assistant", "Chair", "Dean", "Department", | |
"Head"), collapse='|')) | |
I_postdoc = str_detect(dat$Description, paste(c("Post", "Fellow"), collapse='|')) | |
I_lecturer = str_detect(dat$Description, paste(c("Lecture", "Instructor"), collapse='|')) | |
I_statistician = str_detect(dat$Description, paste(c(ignore.case("Biostatistic"), | |
"Statistician", "Scientist", | |
"Staff", "Professional", "Analyst", | |
ignore.case("Researcher"), "Programmer", | |
"Research Associate", "Master", | |
"Manager", "Director", "Investigator", | |
"Specialist", "Consultant", "VP", | |
"Bioinformatician", "Biometrician", | |
"Computational"), collapse='|')) | |
``` | |
Now, let's create a new column variable called "Position" with the job titles | |
```{r} | |
dat$Position <- ifelse(I_postdoc, "Postdoc", ifelse(I_faculty, "Faculty", | |
ifelse(I_lecturer, "Lecturer", | |
ifelse(I_statistician, "Statistician", "Other")))) | |
dat[which(dat$Position == "Other"),] | |
``` | |
We see there are a few descriptions that were not able to be categorized using | |
the regex patterns provided above. We'll use some google-fu next to determine | |
where they belong. | |
Turns out the "University of Pittsburgh" advertisement is for a postdoc. The | |
"Bloomington Campus" and "Statistics" advertisements are for faculty positions. | |
The "14 Early stage researchers" are for statistician positions. I removed | |
the last four ("Summer Internship", "Underwriting Associate", | |
"Grants Proposal Administrator", "Test and Evaluation Subject Matter Expert") | |
as I don't think they are relevant to the analysis here. | |
```{r} | |
dat[which(datDescription == "University of Pittsburgh"),]Position <- "Postdoc" | |
dat[which(datDescription %in% c("Bloomington Campus", "Statistics")),]Position <- "Faculty" | |
dat[which(datDescription %in% c("14 Early stage researchers")),]Position <- "Statistician" | |
dat = dat[!(dat$Description %in% c("Summer Internship", "Underwriting Associate", | |
"Grants Proposal Administrator", | |
"Test and Evaluation Subject Matter Expert")),] | |
dat[which(dat$Position == "Other"),] | |
``` | |
OK, so now we have dealt with grouping all the positions. Let's use the | |
`lubridate` R package to make the Date column more R friendly. I'm using the | |
`mdy()` function to tell R this column contains dates in the form of | |
"month/day/year". The `month()` function extracts the month from each of the | |
rows. | |
```{r} | |
table(month(mdy(dat$Date))) | |
``` | |
Let's add a few other columns to our data frame. | |
```{r} | |
datPosition = factor(datPosition) | |
datDate = mdy(datDate) | |
datmonth = factor(month(datDate, label=TRUE), | |
levels = c("Aug", "Sep", "Oct", "Nov", "Dec", "Jan", "Feb")) | |
datdayOfWeek = wday(datDate, label = TRUE) # day of week | |
``` | |
#### Data visualization | |
The frequency job postings by position, day of the week and month: | |
```{r} | |
ggplot(dat, aes(x = Position)) + geom_bar() # frequency job by type | |
ggplot(dat, aes(x = dayOfWeek)) + geom_bar(position="dodge") | |
ggplot(dat, aes(x = month)) + geom_bar() # frequency job by month | |
``` | |
Job postings by date, day of the week and month (colors represent the type | |
of position). | |
```{r, message=FALSE} | |
ggplot(dat, aes(x = Date, fill = Position)) + geom_bar(position="dodge") | |
ggplot(dat, aes(x = dayOfWeek, fill = Position)) + geom_bar(position="dodge") | |
ggplot(dat, aes(x = month, fill = Position)) + geom_bar(position="dodge") | |
``` | |
Most academic faculty positions are posted Sept-Nov and | |
most postdoc positions are posted after that time period. |
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.