Tax collection forecast for Minas Gerais — Brazil

Rodrigo Salles
BrData
Published in
11 min readJan 28, 2021

--

Tax collection forecast for the State of Minas Gerais — Brazil

Photo by Marcelo Casal Jr(Agência Brasil)

One of the interests of the economic sciences is to make efficient predictions of variables that play a central role in decision making by economic agents. To plan the contribution of resources, governments need to have a tax collection forecast, and in Brazil the tax on the circulation of goods and services (ICMS) represents an important source of funds.

ICMS represents an important source of funds for the State of Minas Gerais. Located in the Southeast region of Brazil, it is the second most populous state in the country, with approximately 20.7 million inhabitants, and is composed of 853 municipalities.

Belo Horizonte(https://pt.wikipedia.org/wiki/Belo_Horizonte)
Minas Gerais(https://pt.mapsofworld.com/brasil/estados/minas-gerais.html)

In a state with such dimensions, efficient resource management is extremely important, and to assist in this task, all possible tools must be used. And in this context, time series can be of great value in revenue forecasts.

This study was developed in R, and the code used can be found here.

Materials and methods

The data used are for public use, made available by the Institute for Applied Economic Research (IPEA), and the analyzes were made in R language, with the help of RStudio software.

The series consists of 276 monthly observations, between the periods of January 1993 and December 2015. Each month has the value of ICMS collection for the State of Minas Gerais. A sample of the data can be seen in the following table:

The methodology adopted consists of first carrying out an exploratory analysis of the data, verifying trends, cycles and seasonality effects. Then tests will be performed to check the stationarity of the series. If necessary, transformations will be made to make the series stationary. Then, several models will be analyzed, and the models that best fit the data will be chosen. Finally, the chosen models will be used to make predictions about the adopted series.
The results will be compared in relation to the absolute errors verified in the forecasts.

Exploratory Data Analysis

As previously explained, the series consists of observations of the amount collected by ICMS in the State of Minas Gerais. The data begins with the value for the month of January 1993, and ends with the collection for the month of December 2015.

Some statistical information of the series can be seen in the following table:

The previous graph shows that the series has an increasing trend, and thus it can be concluded that it is not stationary. It is also noted that the variance is not constant, and the series has shown a particularly turbulent behavior in recent years. This behavior can be partially explained by the economic crisis that the country was going through during that period.
The graph of the autocorrelation function, which can be seen in the following graph, shows the typical shape of a non-stationary series.

The autocorrelation function decreases very slowly to zero, indicating the effects of the trend and possibly seasonality.
The non-stationarity of the series can be confirmed by the unit root test:

The Augmented Dickey-Feller test works with the null hypothesis of the existence of a unit root, the series being non-stationary. It can be seen that the p-value has a value above 0.05 (level of significance), so the test indicates that the series is not stationary.

One more test can be applied to confirm non-stationarity. The Kwiatkowski-Phillips-Schmidt (KPSS) test adopts as a null hypothesis that the series is stationary. The result applied to the ICMS series is as follows:

The p-value of 0.01 indicates the rejection of the null hypothesis, and thus, according to the test, the series is not stationary.

Time series stationarity

Stationarity is an important characteristic because it indicates that probabilistic and statistical properties of the time series remain unchanged over time.
The tests applied previously indicate that the series is not stationary, and therefore, we must resort to some transformations to achieve the objective of making the series stationary.
Assuming an additive model, the series can be broken down into its seasonal, random and trend components. This decomposition can be seen in the following figure:

When analyzing the decomposition of the series, it is possible to notice a strong trend component, and an effect relative to seasonality that is practically negligible.
Taking into account the previous results, differentiation of the series will be used to try to achieve stationarity.
The ndiffs function tells us the number of differentiations needed. When applying this function to the series, we have the indication that only one differentiation is enough. There is also the nsdiffs function, which indicates the number of seasonal differentiations required. When applying the nsdiffs function, we have zero as a return, which confirms the initial analyzes of a negligible seasonal effect.

Building the forecasting models

To start modeling, a first differentiation was carried out, as previously suggested. The result of the autocorrelation and partial autocorrelation function can be seen in the figures below, where the ACF values fall rapidly to the region close to zero.

The unit root test has a low p-value, confirming that a differentiation was, in fact, sufficient to make the series stationary:

The KPSS test, with a p-value above 0.05, indicates that the null hypothesis that the series is stationary cannot be rejected:

To start the tests with the models, the auto.arima function will initially be used to determine the arima model that best fits the series. The result obtained was an ARIMA model (1,1,1) with drift. The model confirms the need for differentiation.

The following figure shows the AIC and BIC parameters for model evaluation. In the same figure you can see that the parameters found ar1 and ma1 are statistically significant:

From the analysis of the figure below, it can be concluded that the residues are not correlated, they are independent and identically distributed. Despite the form presented by the Q-Q graph, it can be said that this model fits well with the data.

In order to make comparisons and possibly obtain better models, it was decided to perform several tests, with different AR and MA parameters, but always maintaining a differentiation (d = 1). The results obtained were grouped in the following table:

In the previous table you can see the values of AIC and BIC for the model, obtained by the auto.arima function. The ARIMA model (3,1,3) obtained the lowest values of AIC and BIC.

An analysis of the parameters and residues of the ARIMA model (3,1,3) can be seen in the figure below:

The graph in the figure above suggests that the parameters are statistically significant, the residues are not correlated, they are independent and identically distributed. Despite the shape presented by the Q-Q graph, in which the extremes deviate from normality.

Model with seasonality effects

Despite the indication of the insignificance of the effects of seasonality, a test was carried out with a model including this effect.

The model found was ARIMA (1,1,1) x (2,0,1)_12. Other possibilities were tested, but there were problems related to the stationarity of the seasonal AR part.

It can be seen that the term sar2, with p-value above 0.05 is not statistically significant.

Residue analysis shows some problems. The autocorrelation graph shows some values outside the limits, and the p-value for the Ljung-Box test shows oscillation and high values in relation to the proximity to zero.

Model with Box Cox transformation

To finish analyzing the possibilities of the Arima models, a model was considered in which the data underwent a Box Cox transformation.

By looking at the graph of the original series, one can see a possible variance problem. To try to stabilize variance, a Box Cox transformation was used. The value of λ was obtained with the aid of the BoxCox.lambda () function, and the value found was 0.5993. This value was used, together with the arima () function, to produce the forecasts.

In the following figure, the analysis suggests that the residues are not correlated:

Forecasts

The previously adjusted models will be used to make predictions about the collection of ICMS for the year 2016. The real values for that year are known and will be used for performance evaluation. Another model, which combines predictors, will also be evaluated. The absolute error will be calculated in each forecast, in order to compare the performances.

ARIMA model (1,1,1) with drift

The first model used will be ARIMA (1,1,1), with drift, automatically adjusted by the auto.arima function. In the figure below you can see the graph related to the model.
The model seems to capture the variation form of the series. The errors associated with the forecast can be seen in the following table.

It can be seen that the smallest error was related to the month of March, and the biggest error occurred in the last month, December, with a value close to 15%. The average absolute error for that year was 5.15%.

ARIMA model (3,1,3) with drift

The next model tested was the ARIMA (3,1,3) manually adjusted. In the next graph it can be seen that the model seems to capture the pattern of change in the series.
In the sequence, the table shows the observed values, real, and the predicted values for each month, as well as the error associated with each forecast.

ARIMA(3,1,3)

Similar to the ARIMA model (1,1,1), the smallest error found was for the month of March, and the highest, for the month of December. The average absolute error was 5.08%.

Model with seasonality ARIMA (1,1,1) x (2,0,1)_12

The forecasts for a model that considers seasonality can be seen in the following graph:

ARIMA (1,1,1) with seasonality

The predictions made, and the model errors can be seen in the following table:

It can be seen that the smallest error was related to the month of August, and the biggest error occurred in the last month, with a value close to 15%. The average absolute error for this model was 5.41%.

Model with Box Cox transformation

The forecast graph and performance tables for the model that used the Box Cox transformation, can be seen below:

The average absolute error for the predictor with the Box Cox transformation was 6.14%, the smallest error being verified for the month of February, and the largest, like the previous models, was for the month of December.

The next model tested will be the combination of predictors.

Combination of predictors

An interesting possibility to improve the performance of forecasts is the combination of forecasts. Each predictor produces a result, for the same series, and the end result is a combination of the individual results.
To try to approximate the value of ICMS for the year 2016, the following models were chosen: ETS, ARIMA and TBATS. As the ARIMA models have already been explained previously, the ETS and TBATS models will be briefly commented.

ETS (Error, Trend, seasonal)

Exponential smoothing is a method of forecasting time series for univariate data.
Time series methods such as the Box-Jenkins family of methods develop a model in which the forecast is a weighted linear sum of recent observations or delays.

The exponential smoothing forecasting methods are similar in that a forecast is a weighted sum of past observations, but the model explicitly uses a decreasing weight for past observations. Collectively, the methods are sometimes called ETS models, referring to modeling based on Error, Trend and Seasonality.

TBATS

The BATS model is an exponential smoothing method combined with Box-Cox transformation and ARMA model. The Box-Cox transformation is intended to handle non-linear data. Alysha M. (2010) demonstrated that the BATS model can improve the performance of predictions compared to a simple state space model.

BATS models do not do well when the series has high frequencies, so Alysha M. (2011) proposed the TABTS model, which is a BATS model combined with seasonal trigonometry. In a reduced way, the exponential smoothing method led to a state space model, which in turn led to a BATS model, which was finally complemented by seasonal trigonometry, forming the TBATS model.

Combination of predictors — forecasts

The forecasts previously described have been combined, and the forecast result is a simple arithmetic mean of the individual results.

The models were trained with the same series used for the previous models. The results can be seen in the following table:

It can be seen that the smallest forecast error was for the month of February, and the highest for the month of December. Contrary to expectations, the model combination underperformed the ARIMA models, with an average absolute error of 7.29%.

Conclusions

Among the models analyzed, the best performance was achieved by the ARIMA models, with an average absolute error of approximately 5.1%. As suggested by the auto.arima function, the model that considers seasonality did not show any improvement in performance. The model that considered the Box Cox transformation, to stabilize the variance, also did not show significant improvements in relation to the ARIMA (1,1,1) and ARIMA (3,1,3) models.

Financial crises and changes in legislation have made forecasts less reliable in recent years. Despite all the difficulties and instabilities, the ARIMA models proved to be efficient, and would be the options for making predictions outside the considered series. In particular the ARIMA model (3,1,3), which had the lowest mean absolute error in the forecasts.

--

--

Rodrigo Salles
BrData

Mechatronic Engineer — Physicist — Data Scientist. Always learning.