In this Section, we will look at the daily number of newly reported cases of COVID-19 worldwide, compiled by at the European Centre for Disease Prevention and Control (ECDC). We will fit an ARIMA (Integrated ARMA model) model to the time series, and make predictions.
In many situations, time series can be thought of as being composed of two components, a nonstationary trend component and a zero-mean stationary component. In fact, this is the general approach we will take to build integrated models for nonstationary data, a.k.a. the ARIMA (I stands for integrated) model.
A process \(x_t\) is said to be
ARIMA(p
, d
, q
) if
\[\nabla^d x_t= \mathrm{(1-B)}^d x_t\] is ARMA(p,q). Here differences of order \(d\) are defined as
\[\nabla^d = \mathrm{(1-B)}^d \] where we may expand the operator \(\mathrm{(1-B)}^d\) algebraically to evaluate for higher integer values of d. When \(d = 1\), we drop it from the notation. And the backshift operator \(\mathrm{B}\) by
\[\mathrm{B}^k x_t = x_{t-k} \]
In this part, we will learn the necessary steps to fitting ARIMA models to time series data:
plotting the data
possibly transforming the data
identifying the dependence orders of the model
parameter estimation
making predictions
1. Load the daily new cases data
Download the .csv
data from here,
take a quick glance at the file. This file contains COVID cases from
Dec. 31, 2019
to Nov. 09, 2020
.
Then move the file (COVID_2020_data.csv
, ~ 4MB) to your
working directory
, load the file using
read.csv()
, and convert it to a tibble
(name
it as COVID_tbl
) using the as_tibble()
function. If you don’t know how to handle tibble
data,
please check Section 02 Data
wrangling and quick plots.
# Read in the COVID-19 data
COVID_data <- read.csv(file = "COVID_2020_data.csv", header = T)
# Check the variable names
head(COVID_data)
# Convert the data.frame to a tibble
COVID_tbl <- as_tibble(COVID_data)
# Show data
COVID_tbl
2. Get the daily total number of newly reported cases worldwide
What you just read from COVID_2020_data.csv
is the daily
number of new cases of COVID-19 by country worldwide (the
cases
column). Notice we will try to get daily
total number of new cases of COVID-19 over the
world.
[Hint] You may find the following lines useful:
# Get global daily new cases
COVID_tbl <- COVID_tbl %>%
mutate(dateRep = as.Date(dateRep,format='%d/%m/%Y')) %>%
group_by(dateRep) %>%
summarize(global_cases = sum(cases))
# Show data
COVID_tbl
3. Plot the data
Plot the daily new cases, what do you observe? Do you see an exponential trend? Why is that?
# Quick plot
plot(COVID_tbl$dateRep,COVID_tbl$global_cases,
type="l", xlab="Date", ylab="Global cases")
4. Filter the data
As you can see from the above figure, daily new case starts to
increase exponentially since late March and early April. To make the
time series more reliable (a.k.a, not effected by initial fluctuations),
let’s only use the data after April 01, 2020
. Do this using
the filter()
function.
# Filter the data, only use data from April 01
COVID_tbl <- COVID_tbl %>%
filter(dateRep >= as.Date("2020-04-01"))
# Show data
COVID_tbl
5. Convert a vector to a time series
Convert the vector from the previous step to a time series using the
ts()
function. As we are dealing with daily data here, use
Julian Day as the indicator of the time series. You can return
the Julian Day of a certain date using the yday()
function
from the lubridate
package.
Plot the time series, what do you observe?
# Start date of the time series, read from the .csv file
Date_start <- as.Date("2020-04-01")
# End date of the time series, read from the .csv file
Date_end <- as.Date("2020-11-09")
# Get the Julian Day of the start and end date
JD_start <- yday(Date_start)
JD_end <- yday(Date_end)
N_days <- JD_end - JD_start + 1
# Convert the vector data to a time series
global_cases_ts <- ts(COVID_tbl$global_cases[1:N_days], start=c(2020,JD_start), frequency=365)
# The indicator of the time series
inds <- seq(Date_start, Date_end, by="day")
# Check structure
str(global_cases_ts)
# Plot time series
plot(inds, global_cases_ts, type="l")
6. Transform the time series
Take log
to the global_cases_ts
time
series, and assign the results to a new object
global_cases_ts_log
. Plot global_cases_ts_log
,
also check acf and pacf of global_cases_ts_log
.
Do you see a trend in the time series
global_cases_ts_log
? And is this time series
stationary?
# Data transform with log
global_cases_ts_log <- log(global_cases_ts)
# Plot time series
plot(inds, global_cases_ts_log, type="l")
# Check acf and pacf
acf(global_cases_ts_log)
pacf(global_cases_ts_log)
7. Take the difference
One way to make a non-stationary time series stationary is to take
the difference. For an
ARIMA(p
,d
,q
) model,
d=1
if difference is taken once, and d=2
if
difference is taken twice. In R, this is done with the
diff()
function.
What do you see from the time series plot? Does the time series look stationary now?
# Take the diff, d=1
global_cases_ts_log_d1 <- diff(global_cases_ts_log)
# Plot time series
plot(global_cases_ts_log_d1,type="l")
# Check acf and pacf
acf(global_cases_ts_log_d1)
pacf(global_cases_ts_log_d1)
8. Auto ARIMA fitting
In the forecast
package, R provides a very useful
function called auto.arima()
, which returns the best ARIMA
model according to either AIC, AICc or BIC value. The function conducts
a search over possible models within the order constraints provided.
Run the following lines:
# Automated forecasting using an ARIMA model
model <- auto.arima(global_cases_ts_log)
# Show details of the ARIMA model
model
What is the best ARIMA model? Can you write its math expression?
9. Make predictions
Once you get an ARIMA model, you can use it to predict values, new cases in this demo, in the future time steps.
Run the following lines:
# Number of days to predict
days_forecast <- 30
# Number of include in the plot
days_in_plot <- 30
# Make predictions using the forecast() function
forecast_30days <- forecast(model, days_forecast)
# Plot
plot(forecast_30days, include=days_in_plot,
xlab="Time", ylab="log(global cases)", type="o", lwd=2)
Here the blue line is the estimated values, the blue shaded area
shows the 80%
confidence interval, and the gray area shows
the 95%
confidence interval.
Change days_forecast
to, see how the predictions would
vary.
10. Get predicted values
What is the predicted value on Nov. 10, 2020
? And What
is the predicted value on Nov. 30, 2020
?
Report estimated values and their 80%
confidence
intervals.
# Get predicted values on Nov 10, 2020
day_forward <- yday(as.Date("2020-11-10")) - yday(Date_end)
exp(forecast_30days$mean[day_forward])
exp(forecast_30days$lower[day_forward,1])
exp(forecast_30days$upper[day_forward,1])
# Get predicted values on Nov 30, 2020
day_forward <- yday(as.Date("2020-11-30")) - yday(Date_end)
exp(forecast_30days$mean[day_forward])
exp(forecast_30days$lower[day_forward,1])
exp(forecast_30days$upper[day_forward,1])
11. Verify the predictions
Now download the latest data from ECDC,
compute the global total number of new cases of COVID-19 on
Nov. 10, 2020
and Nov. 30, 2020
.
Compare this number with the predicted value from the previous step. Do you get a reasonable prediction?
12. [optional] Even more: Fit a SARIMA model
In fact, the global_cases_ts_log
shows weekly cycles,
this is not well considered in the model. You can make an even better
ARIMA model by removing this pattern, or by fitting a SARIMA model,
where “S” stands for seasonal. Look at package sarima
for more.