---
title: "Introduction to dfms"
author: "Sebastian Krantz"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to dfms}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  cache = TRUE,
  fig.width = 7,
  fig.height = 5,
  comment = "#>"
)

opt <- options(max.print = 70)
```

*dfms* provides a user friendly and computationally efficient approach to estimate linear Gaussian Dynamic Factor Models in R. The package is not geared at any specific application, and can be used for dimensionality reduction, forecasting and nowcasting systems of time series. The use of the package is facilitated by a comprehensive set of methods to explore/plot models and extract results.

```{r setup, message=FALSE}
library(dfms)
library(xts)
```

This vignette walks through the main features of the package. The data provided in the package in *xts* format is taken from Banbura and Modugno (2014)^[Banbura, M., & Modugno, M. (2014). Maximum likelihood estimation of factor models on datasets with arbitrary pattern of missing data. *Journal of Applied Econometrics, 29*(1), 133-160.], henceforth BM14, and covers the Euro Area from January 1980 through September 2009.

```{r}
# Using the monthly series from BM14
dim(BM14_M)
range(index(BM14_M))
head(colnames(BM14_M))
plot(scale(BM14_M), lwd = 1)
```

The data frame `BM14_Models` provides information about the series^[Both about the monthly ones and quarterly series contained in `BM14_Q`. The order of rows in `BM14_Models` matches the column order of series in `merge(BM14_M, BM14_Q)`.], and the various models estimated by BM14. 

```{r}
head(BM14_Models, 3)

# Using only monthly data
BM14_Models_M <- subset(BM14_Models, freq == "M")
```

Prior to estimation, all data is differenced by BM14, and some series are log, differenced, as indicated by the `log_trans` column in `BM14_Models`. In general, *dfms* uses a stationary Kalman Filter with time-invariant system matrices, and therefore expects data to be stationary. Data is also scaled and centered^[Data must be scaled and centered because the Kalman Filter has no intercept term.] in the main `DFM()` function, thus this does not need to be done by the user. 

```{r}
library(magrittr)
# log-transforming and first-differencing the data
BM14_M[, BM14_Models_M$log_trans] %<>% log()
BM14_M_diff = diff(BM14_M)
plot(scale(BM14_M_diff), lwd = 1)
```

## Determining the Structure of the Model

Before estimating a model, the `ICr()` function can be applied to determine the number of factors. It computes 3 information criteria proposed in Bai and NG (2002)^[Bai, J., Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. *Econometrica, 70*(1), 191-221. <doi:10.1111/1468-0262.00273>], whereby the second criteria generally suggests the most parsimonious model. 

```{r}
ic = ICr(BM14_M_diff)
print(ic)
plot(ic)
```

Another option is to use a Screeplot to gauge the number of factors by looking for a kink in the plot. A mathematical procedure for finding the kink was suggested by Onatski (2010)^[Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. *The Review of Economics and Statistics, 92*(4), 1004-1016.], but this is currently not implemented in `ICr()`. 

```{r}
screeplot(ic)
```

Based on both the information criteria and the Screeplot, I gauge that a model with 4 factors should be estimated, as factors, 5, 6 and 7 do not add much to the explanatory power of the model. Next to the number of factors, the lag order of the factor-VAR of the transition equation should be estimated (the default is 1 lag). This can be done using the `VARselect()` function from the *vars* package, with PCA factor estimates reported by `ICr()`.

```{r}
# Using vars::VARselect() with 4 principal components to estimate the VAR lag order
vars::VARselect(ic$F_pca[, 1:4])
```

The selection thus suggests we should estimate a factor model with ` r = 4` factors and `p = 3` lags^[Some authors like Doz, Giannone, and Reichlin (2012) additionally allow the number of transition error processes, termed 'dynamic factors' and given an extra parameter `q`, to be less than the number of 'static factors' `r`. I find this terminology confusing and the feature not very useful for most practical purposes, thus I have not implemented it in *dfms*.]. Before estimating the model I note that *dfms* does not deal with seasonality in series, thus it is recommended to also seasonally adjust data, e.g. using the *seasonal* package before estimation. BM14 only use seasonally adjusted series, thus this is not necessary with the example data provided. 

## Estimation and Exploration

Estimation can then simply be done using the `DFM()` function with parameters `r` and `p`^[Users can also choose from two different implementations of the EM algorithm using the argument `em.method`. The default is `em.method = "auto"`, which chooses the modified EM algorithm proposed by BM14 for missing data if `anyNA(X)`, and the plain EM implementation of Dos, Giannone and Reichlin (2012), henceforth DGR12, otherwise. The baseline implementation of DGR12 is typically more than 2x faster than BM14's method, and can also be used with data that has a few random missing values (< 5%), but gives wrong results with many and systematically missing data, such as ragged edges at the beginning of end of the sample or series at different frequencies. With complete datasets, both `em.method = "DGR"` and `em.method = "BM"` gives identical factor estimates, and in this case `em.method = "DGR"` should be used.]. 

```{r}
# Estimating the model with 4 factors and 3 lags using BM14's EM algorithm
model1 = DFM(BM14_M_diff, r = 4, p = 3)
print(model1)
plot(model1)
```

The model can be investigated using `summary()`, which returns an object of class 'dfm_summary' containing the system matrices and summary statistics of the factors and the residuals in the measurement equation, as well as the R-Squared of the factor model for individual series. The print method automatically adjusts the amount of information printed to the data size. For large databases with more than 40 series, no series-level statistics are printed. 

```{r}
dfm_summary <- summary(model1)
print(dfm_summary) # Large model with > 40 series: defaults to compact = 2

# Can request more detailed printouts
# print(dfm_summary, compact = 1)
# print(dfm_summary, compact = 0) 
```

Apart from the model summary, the *dfm* methods `residuals()` and `fitted()` return observation residuals and fitted values from the model. The default format is a plain matrix, but the functions also have an argument to return data in the original (input) format.

```{r}
plot(resid(model1, orig.format = TRUE))
plot(fitted(model1, orig.format = TRUE))
```


Another way to examine the factor model visually is to plot the Quasi-Maximum-Likelihood (QML) factor estimates against PCA and Two-Step estimates following Doz, Giannone and Reichlin (2011)^[Doz, C., Giannone, D., & Reichlin, L. (2011). A two-step estimator for large approximate dynamic factor models based on Kalman filtering. *Journal of Econometrics, 164*(1), 188-205.], where the Kalman Filter and Smoother is run only once. Both estimates are also computed by `DFM()` during EM estimation and can also be visualized with `plot.dfm`.

```{r}
plot(model1, method = "all", type = "individual")
```

The plot with the various estimates shows that the QML estimates are more volatile in the initial periods where there are many missing series, but less volatile in the latter periods. In general, QML estimates may now always be superior across the entire data range to Two-Step and PCA estimates. Often, Two-Step estimates also provide similar forecasting performance, and are much faster to estimate using `DFM(BM14_M_diff, r = 4, p = 3, em.method = "none")`.  

The factor estimates themselves can be extracted in a data frame using `as.data.frame()`, which also provides various options regarding the estimates retained and the format of the frame. It is also possible to add a time variable from the original data (the default is a sequence of integers).

```{r}
# Default: all estimates in long format
head(as.data.frame(model1, time = index(BM14_M_diff)))
```

## Forecasting

DFM forecasts can be obtained with the `predict()` method, which dynamically forecasts the factors using the transition equation (default 10 periods), and then also predicts data forecasts using the observation equation. Objects are of class 'dfm_forecast'

```{r}
# 12-period ahead DFM forecast
fc = predict(model1, h = 12)
print(fc)
```

These forecasts can also be visualized using a plot method. By default the entire series history is plotted along with the forecasts, thus it is often helpful to restrict the plot range. As with any stationary autoregressive model, the forecasts tend to zero quite quickly^[Depending also on the lag-order of the factor-VAR. Higher lag-orders product more interesting forecast dynamics.].  
```{r}
# Setting an appropriate plot range to see the forecast
plot(fc, xlim = c(320, 370))
# Predicting with Two-Step estimates
# plot(predict(model1, h = 12, method = "2s"), xlim = c(320, 370))
```


The forecasts can also be retrieved in data frame using `as.data.frame()`. Again the method has various arguments to control the forecasts retained (factors, data or both, default factors), and the format of the frame. 

```{r}
# Factor forecasts in wide format
head(as.data.frame(fc, pivot = "wide"))
```

## Estimation with Mixed Frequency

*dfms* currently provides no specific adjustments for data at different frequencies. An algorithm that accommodates monthly and quarterly series is planned for summer 2023. In the meantime, users may choose to block the data (creating multiple quarterly series from a monthly series, and duplicating quarterly series to maintain equal representation).

<!-- 
The baseline algorithm of BM14, being designed for arbitrary patterns of missing data, allows lower frequency series to be included in the estimation. The problem here is that series at lower frequency contain many missing values, which effectively lets these series have less weight in the estimation.

BM14 use the approximation of Mariano and Murasawa (2003) to construct a monthly equivalent for each quarterly series, with appropriate factor loadings in the estimation. 

In the absence of such adjustments to the algorithm, or other methods of interpolating quarterly data, a very simple way to increase the weight of these series in the estimation is to duplicate them in the dataset. Such duplication can be mechanical (e.g. duplicate quarterly series 2 times in monthly dataset), but should ideally be based on considerations about the quality of the signal stemming from different quarterly series (i.e. informative series should be duplicated more).  -->

```{r, eval = FALSE, include=FALSE}
# Quarterly series from BM14
head(BM14_Q, 3)
# Pre-processing the data
BM14_Q[, BM14_Models[BM14_Models$freq == "Q", ]$log_trans] %<>% log()
BM14_Q_diff = diff(BM14_Q)
# Merging to monthly data and duplicating 2 times
BM14_diff = cbind(BM14_M_diff, BM14_Q_diff, BM14_Q_diff, BM14_Q_diff)

# Estimating the model with 5 factors and 3 lags using BM14's EM algorithm
model2 = DFM(BM14_diff, r = 5, p = 3)
print(model2)
plot(model2)
```

<!--
Another possibility is to estimate different factor models for monthly and quarterly series, and combine them for the final prediction (e.g. by aggregating or blocking the monthly factor estimates). 

If performance is not critical, I also refer the user to the [*nowcasting*](<https://github.com/nmecsys/nowcasting>) and [*nowcastDFM*](<https://github.com/dhopp1/nowcastDFM>) packages, which provide elaborate models for mixed-frequency nowcasting, and handle mixed frequencies like BM14 (i.e. drawing on Mariano and Murasawa (2003)). -->

## Additional Functions

*dfms* also exports central functions that help with DFM estimation, such as imputing missing values with `tsnarmimp()`, estimating a VAR with `.VAR()`, or Kalman Filtering and Smoothing with `SKFS()`, or separately with `SKF()` followed by `FIS()`. To my knowledge these are the fastest routines for simple stationary Kalman Filtering and Smoothing currently available in R. The function `em_converged()` can be used to check convergence of the log-likelihood in EM estimation. 

*dfms* also exports a matrix inverse and pseudo-inverse from the Armadillo C++ library through the functions `ainv()` and `apinv()`. These are often faster than `solve()`, and somewhat more robust in near-singularity cases. 

## References

Doz, C., Giannone, D., & Reichlin, L. (2011). A two-step estimator for large approximate dynamic factor models based on Kalman filtering. *Journal of Econometrics, 164*(1), 188-205.

Doz, C., Giannone, D., & Reichlin, L. (2012). A quasi-maximum likelihood approach for large, approximate dynamic factor models. *Review of Economics and Statistics, 94*(4), 1014-1024.

Banbura, M., & Modugno, M. (2014). Maximum likelihood estimation of factor models on datasets with arbitrary pattern of missing data. *Journal of Applied Econometrics, 29*(1), 133-160.

Mariano, R. S., & Murasawa, Y. (2003). A new coincident index of business cycles based on monthly and quarterly series. *Journal of Applied Econometrics, 18*(4), 427-443.

Bai, J., Ng, S. (2002). Determining the Number of Factors in Approximate Factor Models. *Econometrica, 70*(1), 191-221. 

Onatski, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. *The Review of Economics and Statistics, 92*(4), 1004-1016.

Stock, J. H., & Watson, M. W. (2016). Dynamic Factor Models, Factor-Augmented Vector Autoregressions, and Structural Vector Autoregressions in Macroeconomics. *Handbook of Macroeconomics, 2*, 415–525. 

```{r, include=FALSE}
options(opt)
```