Contributors: Michelle Lee, Julián Ponce, Adarsh Ramakrishnan, Allison Stewart, Aishwarya Anuraj (MPH candidates at Columbia University)
The central artery of New York City, the public transportation system is central to the city’s cultural and economic well-being. Among its usual million daily users such as hip hop dancers, Wall Street executives, and the famous pizzarat, the subway proves to be more than a means of transport; it is a way of life.
Unfortunately, the COVID-19 pandemic and the “New York State on PAUSE” executive order, issued on March 14, 2020 aimed to slow the spread of the virus, brought tremendous shock to the normal bustle of New York City. Closed museums, theaters, restaurants, and suspended commutes have created ghost town-like scenes and some people to proclaim that New York City is “over” as we know it.
This project moves forward with the aim of understanding the influence of the COVID-19 pandemic on mobility trends in NYC as measured by MTA ridership on subway trains and buses from March to November 2019 and 2020.
Understanding this important issue could provide insight to policymakers, transit authorities, and health departments as they make decisions related to transmission of COVID-19 and modifications to public transportation systems in light of stay-at-home orders.
We used a publicly available dataset on day-by-day ridership numbers from the MTA. The dataset contains information on total estimated ridership and percentage change from 2019 by day, beginning March 1, 2020. Estimates of ridership by subway, bus (local, limited, SBS, and Express), Long Island Rail Road, Metro-North Railroad, Access-A-Ride, and Bridges and Tunnels are all available, however, we decided to restrict our analysis to subway and bus ridership.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(patchwork)
library(readr)
library(broom)
library(dbplyr)
library(viridis)
library(reshape2)
library(plotly)
library(lubridate)
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
fig.width = 8,
fig.height = 6,
out.width = "90%"
)
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis"
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
theme_set(theme_minimal() + theme(legend.position = "bottom"))
mta_data = read_csv (file = "./data/MTA_recent_ridership_data_20201123_0.csv",
col_types = cols(
date = col_date(format = "%mm/%dd/%yy"),
`Subways: % Change From 2019 Equivalent Day` = col_number(),
`Buses: % Change From 2019 Equivalent Day` = col_number(),
`Bridges and Tunnels: % Change From 2019 Equivalent Day` = col_number()
) #only changed the formats of important variables
) %>%
janitor::clean_names()
skimr::skim(mta_data)
mta_data =
mta_data %>%
subset(select = -c(lirr_total_estimated_ridership, lirr_percent_change_from_2019_monthly_weekday_saturday_sunday_average, metro_north_total_estimated_ridership, metro_north_percent_change_from_2019_monthly_weekday_saturday_sunday_average, access_a_ride_total_scheduled_trips, access_a_ride_percent_change_from_2019_monthly_weekday_saturday_sunday_average, bridges_and_tunnels_total_traffic, bridges_and_tunnels_percent_change_from_2019_equivalent_day))
#exclude data for lirr, metronorth, access-a-ride, bridges & tunnel
#calculate 2019 ridership data
mta_data = mta_data %>%
mutate(
'subway_2019' = subways_total_estimated_ridership/(1+(subways_percent_change_from_2019_equivalent_day/100)),
'bus_2019'=
buses_total_estimated_ridership/(1+(buses_percent_change_from_2019_equivalent_day/100))
) %>%
rename(
"subway_2020" = subways_total_estimated_ridership,
"subway_pct_change" = subways_percent_change_from_2019_equivalent_day,
"bus_2020" = buses_total_estimated_ridership,
"bus_pct_change" = buses_percent_change_from_2019_equivalent_day
)
#change date to date format and order by date
plot_mta =
mta_data %>%
mutate(
date= as.Date(date,format = "%m/%d")) %>%
arrange(date)
#create a text_label label
text_label =
plot_mta %>%
mutate(text_label = str_c("percent change from 2019 to 2020: ", subway_pct_change)) %>%
select(date, text_label)
#graph for subway
#pivoting
plot_subway =
plot_mta %>%
select(subway_2020, subway_2019, date) %>%
melt(., id.vars = "date") %>%
#merge based on month
merge(text_label, by = "date") %>%
#plotting
plot_ly(
x = ~date, y = ~value, type = "scatter", mode = "markers",
color = ~variable, text = ~text_label) %>%
layout (
title = "Figure 2. Subway Ridership Trends 2019 - 2020",
xaxis = list(title ="Month/Day", tickformat = "%m/%d"), #drop year
yaxis = list(title="Ridership")) %>%
add_lines(x =as.Date("2020-03-01"), line = list(dash="dot", color = 'red', width=0.5, opacity = 0.5),name = 'First case on 3/1') %>%
add_lines(x =as.Date("2020-04-07"), line = list(dash="dot", color = 'red', width=0.5, alpha = 0.5),name = '100K cases in NYC on 04/07') %>%
add_lines(x =as.Date("2020-05-26"), line = list(dash="dot", color = 'red', width=0.5, alpha = 0.5),name = '200K cases in NYC on 05/26')
#graph for bus
#pivoting
plot_bus =
plot_mta %>%
select(bus_2020,bus_2019,date) %>%
melt(., id.vars = "date") %>%
#merge based on month
merge(text_label, by = "date") %>%
#plotting
plot_ly(
x = ~date, y = ~value, type = "scatter", mode = "markers",
color = ~variable, text = ~text_label) %>%
layout (
title = "Figure 1. Bus Ridership Trends 2019 - 2020",
xaxis = list(title ="Month/Day", tickformat = "%m/%d"),
yaxis = list(title="Ridership")) %>%
add_lines(x =as.Date("2020-03-01"), line = list(dash="dot", color = 'red', width=0.5, opacity = 0.5),name = 'First case on 3/1') %>%
add_lines(x =as.Date("2020-04-07"), line = list(dash="dot", color = 'red', width=0.5, alpha = 0.5),name = '100K cases in NYC on 04/07') %>%
add_lines(x =as.Date("2020-05-26"), line = list(dash="dot", color = 'red', width=0.5, alpha = 0.5),name = '200K cases in NYC on 05/26')
#ggplot for subway
plot_subway_2 =
plot_mta %>%
select(subway_2020, subway_2019, date) %>%
melt(., id.vars = "date") %>%
ggplot(aes(x=date, y=value, color=variable)) +
geom_point(alpha=0.5) +
geom_smooth(se = FALSE) +
xlab("Month/Day") +
ylab("Ridership") +
ggtitle("Subway Ridership Trends 2019 - 2020") +
geom_vline(xintercept=as.Date("2020-03-01"), linetype="dotted", color = 'red')+
geom_vline(xintercept=as.Date("2020-04-07"), linetype="dotted", color = 'red')+
geom_vline(xintercept=as.Date("2020-05-26"), linetype="dotted", color = 'red')+
geom_text(x=as.Date("2020-03-01"), y=6100000, label ="First case", angle=90, vjust = 1.2, size=3,color='black') +
geom_text(x=as.Date("2020-04-07"), y=6100000, label ='100K cases', angle=90, vjust = 1.2, size=3,color='black')+
geom_text(x=as.Date("2020-05-26"), y=6100000, label ='200K cases', angle=90, vjust = 1.2, size=3,color='black')
#ggplot for bus
plot_bus_2 =
plot_mta %>%
select(bus_2020, bus_2019,date) %>%
melt(., id.vars = "date") %>%
ggplot(aes(x=date, y=value, color=variable)) +
geom_point(alpha=0.5) +
geom_smooth(se = FALSE) +
xlab("Month/Day") +
ylab("Ridership") +
ggtitle("Bus Ridership Trends 2019 - 2020") +
geom_vline(xintercept=as.Date("2020-03-01"), linetype="dotted", color = 'red')+
geom_vline(xintercept=as.Date("2020-04-07"), linetype="dotted", color = 'red')+
geom_vline(xintercept=as.Date("2020-05-26"), linetype="dotted", color = 'red')+
geom_text(x=as.Date("2020-03-01"), y=2400000, label ="First case", angle=90, vjust = 1.2, size=3,color='black') +
geom_text(x=as.Date("2020-04-07"), y=2400000, label ='100K cases', angle=90, vjust = 1.2, size=3,color='black')+
geom_text(x=as.Date("2020-05-26"), y=2400000, label ='200K cases', angle=90, vjust = 1.2, size=3,color='black')
#see the two graphs together:
plot_bus
plot_subway
(plot_bus_2+plot_subway_2 + plot_annotation(
title = 'Figure 3. Bus and Subway Ridership Trends 2019-2020', theme = theme(plot.title = element_text(size = 20)
)))

Highlights:
Overall reduction in bus and subway ridership in 2020 compared to 2019
Substantial drop in bus and subway ridership from March to April 2020
Fewer people chose to use the subway as compared to buses after April, 2020
In Figure 1., which shows trends in bus ridership, there appears to be a continuous drop in ridership between the first COVID-19 case on March 1, 2020 and April 1, 2020. This trend is followed by an increase in ridership in April through July 2020 and then subsequently a broadening of the range of values around August 2020, with some reductions in values and others remaining generally constant. From the 2019 estimates, we also notice many outlying lower values, produced by decreased bus ridership on Thursdays and Fridays. This is likely a result of fewer commuters on those days.
In Figure 2., which shows trends in subway ridership, there appears to be similarities to the bus ridership trends from Figure 1. However, there are fewer overlapping values when comparing the years 2019 and 2020. The increase in subway ridership, beginning around April 2020 is also less extreme than the increase in bus ridership and there is a more narrow range of values in the 2020 subway ridership as compared to bus ridership. Finally, the 2020 subway ridership does not appear to be decreasing in recent months as with bus ridership. These trends suggest that as people began to use public transportation again during the pandemic, fewer people chose to use the subway as compared to buses.
In Figure 3. we observe that from mid-March onward there are overall reduced ridership trends in 2020 as compared to 2019 for both forms of transit.
We conducted a two-sample t-test to explore the difference in average subway ridership between 2019 and 2020.
#separate by month and day
mta_data =
mta_data %>%
arrange(date) %>%
separate(date, into = c("month", "day", "year"))%>%
mutate(month = as.numeric(month),
day = as.numeric(day)) %>%
select(-c(year)) #drop year column
mta_subway_ridership =
mta_data %>%
group_by(month)%>%
summarize(
avg_subway_2019 = mean(subway_2019),
avg_subway_2020 = mean(subway_2020)
)
mta_subway_ridership%>%knitr::kable()
mta_2019_sample =
mta_data%>%
select(month, subway_2019) %>%
nest(subway_2019)%>%
mutate("subway_2019_sample" = data)%>%
select(-data)
mta_2020_sample =
mta_data%>%
select(month, subway_2020)%>%
nest(subway_2020)%>%
mutate("subway_2020_sample" = data)%>%
select(-data)
mta_samples =
bind_cols(mta_2019_sample, mta_2020_sample)%>%
select(-month...3)%>%
rename(month = month...1)
mta_t_test =
mta_samples%>%
mutate(t_test = map2(.x = subway_2019_sample, .y = subway_2020_sample, ~t.test(.x , .y) ),
t_test_results = map(t_test, broom::tidy))%>%
select(month, t_test_results)%>%
unnest(t_test_results)%>%
select(month,p.value)%>%
mutate(difference = case_when(
p.value >= 0.05 ~ "insignificant",
p.value < 0.05 ~ "significant"),
p.value = ifelse(
p.value< 0.001,"<0.001",round(p.value, digits = 4))) %>%
arrange(month)
knitr::kable(mta_t_test, digits = 3)
Merged t-test results with the average ridership for each month.
mta_year_ttest =
bind_cols(mta_subway_ridership, mta_t_test)%>%
select(-month...4)%>%
rename(month = month...1)
knitr::kable(mta_year_ttest)
| month | avg_subway_2019 | avg_subway_2020 | p.value | difference |
|---|---|---|---|---|
| 3 | 4746353 | 2381477.6 | <0.001 | significant |
| 4 | 4873551 | 391671.7 | <0.001 | significant |
| 5 | 4682924 | 493526.1 | <0.001 | significant |
| 6 | 4883863 | 798982.6 | <0.001 | significant |
| 7 | 4516652 | 1050498.6 | <0.001 | significant |
| 8 | 4343573 | 1137601.2 | <0.001 | significant |
| 9 | 4886223 | 1428918.6 | <0.001 | significant |
| 10 | 4964620 | 1549459.3 | <0.001 | significant |
| 11 | 4879758 | 1522150.0 | <0.001 | significant |
The following plot was created using the final data frame to depict average ridership for each month comparing 2019 and 2020. The labels that appear at each data point as you hover over them provide additional information about the significance or lack of significance in ridership difference.
#create a text_label label
text_label =
mta_year_ttest %>%
mutate(text_label = str_c("p-value: ",p.value, "\nDifference: ", difference)) %>%
select(month, text_label)
#pivoting
plot_ttest =
mta_subway_ridership %>%
rename(
"2020"=avg_subway_2020,
"2019"=avg_subway_2019
) %>%
melt(., id.vars = "month") %>%
#merge based on month
merge(text_label, by = "month")
#plotting
plot_ttest %>%
plot_ly(
x = ~month, y = ~value, type = "scatter", mode = "lines+markers",
color = ~variable, text = ~text_label) %>%
layout (
title = "Monthly Average Ridership of Subway 2019 vs 2020",
xaxis = list(title ="Months",range=c(3,11)),
yaxis = list(title="Average Ridership"))
The results indicate that there is a statistically significant difference (p <.001) in subway ridership trends across all months(March-November)in 2019 and 2020.
Next, we proposed a regression model for subway ridership in 2020.
subway_ridership = lm(subway_2020 ~ month, data = mta_data)
subway_ridership %>%
broom::tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 965080. 160401. 6.02 0.00000000587
## 2 month 32473. 21860. 1.49 0.139
#to tidy the output and get only the intercept, slope and p-values
subway_ridership %>%
broom::tidy() %>%
select(term, estimate, p.value) %>%
knitr::kable(digits = 3)
| term | estimate | p.value |
|---|---|---|
| (Intercept) | 965079.87 | 0.000 |
| month | 32472.74 | 0.139 |
#to plot for regression line
ggplot(mta_data, aes(month, subway_2020)) +
geom_point() +
stat_smooth(method = lm)+
xlab("Month in 2020") +
ylab("Subway Ridership") +
ggtitle(" Figure 5. Subway Ridership 2020 Regression")

Then, we built a plot of model residuals against fitted values for subway ridership and bus in 2020.
mta_data %>%
modelr::add_predictions(subway_ridership) %>%
modelr::add_residuals(subway_ridership) %>%
ggplot(aes(x = pred, y = resid)) + geom_point() +
labs(x = "Predicted value",
y = "Residual") +
ggtitle("Figure 6. Residual vs. Predicted value of Subway 2020 Ridership Regression Model")

bus_ridership = lm(bus_2020 ~ month, data = mta_data)
bus_ridership %>%
broom::tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 717060. 64077. 11.2 4.46e-24
## 2 month 26325. 8733. 3.01 2.82e- 3
#Now we need to tidy the output and get only the intercept, slope and p-values
bus_ridership %>%
broom::tidy() %>%
select(term, estimate, p.value) %>%
knitr::kable(digits = 3)
| term | estimate | p.value |
|---|---|---|
| (Intercept) | 717060.27 | 0.000 |
| month | 26324.69 | 0.003 |
#to plot for regression line
ggplot(mta_data, aes(month, bus_2020)) +
geom_point() +
stat_smooth(method = lm)+
xlab("Month in 2020") +
ylab("Bus Ridership") +
ggtitle(" Figure 7. Bus Ridership 2020 Regression")

Next, we will built a plot of model residuals against fitted values for bus ridership in 2020.
mta_data %>%
modelr::add_predictions(bus_ridership) %>%
modelr::add_residuals(bus_ridership) %>%
ggplot(aes(x = pred, y = resid)) + geom_point() +
labs(x = "Predicted value",
y = "Residual")

ggtitle("Residual vs. Predicted value of Subway 2020 Ridership Regression Model")
## $title
## [1] "Residual vs. Predicted value of Subway 2020 Ridership Regression Model"
##
## attr(,"class")
## [1] "labels"
Analysis using t test support our hypothesis that the average MTA subway ridership in 2020 was significantly different from that of 2019. We further observe in the regression analysis that as months progressed there were increases in both subway and bus ridership.
From mid-March onward, there are overall reduced ridership trends in 2020 as compared to 2019 for both forms of transit.There is a steep decline in ridership starting late March/ Early April. This could be due to (1) the statewide PAUSE order where people who were not essential workers did not use MTA to commute to work,(2) closing of schools, in which millions of students use MTA to go to school, and (3) decline in tourism. We can observe an increase in ridership for both subway and bus, starting early June. This could be due to (1) people are going back to school and work, and (2) people feel comfortable using public transportation as the rates of infection in New York City slowed down, compared to March - late May, where we reached 200,000 cumulative cases for COVID-19 within 50 days.
The implications of our findings are valuable in understanding the impact of COVID-19 on usage of public transportation in NYC. Our findings could provide insight to transit authorities in regards to changes in schedules in light of recent budget cuts. Health departments and contact traces could benefit from using these findings about population movements and density to better understand future outbreaks of COVID-19. Most importantly, state officials can understand how measures such as stay at home orders can influence overall transit density and consumer spending.
Further studies should examine additional modes of transportation such as personal vehicles, air traffic, and bike usage. Analysis broken down by borough could provide further details on mobility trends among neighborhoods of different socio-economic status.