ARIMA Model in Time Series 

ARIMA Model in Time Series 

9 mins read896 Views Comment
Updated on Jun 8, 2022 12:10 IST

The below article explains the ARIMA Model in Time Series extensively. It covers ARIMA features and functions in detail along with examples.

2022_06_arima.jpg

Contents

ARIMA Model in Time Series 

As we covered in one of our previous articles, “A time series is just a sequence of several data points or observations (Sales, demands, stock price, and so on) that occurred in a consecutive order for a given period of time.” So, the next time you plan to guess the share price of Infosys or the sales volume of Ferro Rocher chocolates in a certain region, time series analysis is going to be the tool of choice.  

Time series analysis is a forecasting method that takes into consideration previous data points and also the independent drivers in some times and uses those data points to predict how the future would look like. 

Also Read: What is the Future of Machine Learning?

Components of a Time Series

Trend: It is the constant upward or downward movement of the Time series data. In economic terms, the length to observe a trend should be >=10 years.

Seasonality: Seasonality or seasonal component of a time series may be defined as the repetitive upward or downward movement (or fluctuations) within some fixed intervals. For example, if we talk about sales of garments and clothes or apparel going up every year during the festive seasons of Diwali, Christmas, etc., in India. Those periodic up movements, as depicted in the below figures as well are termed seasonality. 

Cyclicality: Cyclical variations also have recurring patterns but with a longer and more erratic time scale compared to Seasonal variations. This happens mostly due to macroeconomic changes like recession, unemployment, etc. These cycles can be far from regular, and it is usually impossible to predict just how long periods of expansion or contraction will be. They usually last for 2 – 10 Years (Economic cycles etc.)  

Irregular component: Irregular components of a time series data may be defined as unexpected situations/events/scenarios and spikes in a short time span. Some examples of irregularity may be attributed to sudden changes in interest rates, the collapse of companies, natural disasters, shifts in government policies, and so on. 

components
Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

1.53 L
11 months
2.5 L
2 years
34.65 K
11 months
2.5 L
2 years
– / –
8 hours
5.6 L
18 months
– / –
6 months

Time Series Methods

Time series forecasting can be done in various ways. Some of them are straightforward by using simple average, moving average, or exponential smoothening, and in some cases, they use advanced statistical modelling using lagged terms. We are going to discuss one of that methodologies today called ARIMA or Auto-Regressive Integrated Moving Average method for time series forecasting. 

Decoding ARIMA

Let’s first try to understand what are the individual components of an ARIMA model. An ARIMA model in a broad sense has two major components. One of them is called the Autoregressive part and the second one is termed as Moving Average part. Let’s try to understand what these individual components are all about and how they create the final forecasting model together.

Let’s first talk about the autoregressive part. As the name suggests the autoregression method involves forming a linear regression by considering the lagged version of the dependent variable as an independent driver. To further simplify; say at any point in time the dependent time series variable is yt. We will create a linear regression equation to explain yt by using the lagged version of it; e.g.: yt-1 or yt-2 and so on.  

An AR model with lag p is represented as:

yt=C+b1*yt-1+b2*yt-2+ . . +bp*yt-p+E

Where yt) is the observation at time t

 yt-1 is the observation just before yt ; C is the constant and E is the error term. So, in simple term yt here is a function of its own lags.

Let’s now talk about the MA or Moving Average part of the ARIMA model. Similar to AR; for the MA it is a function of the lagged errors and not yt itself. The only difference between AR and MA lies in the independent driver. For AR its lagged version of the dependent variable yt and for MA it’s the lagged version of the error E

A typical Moving Average equation of lag q would therefore look like:

yt= α+Et+ µ1Et-1+ µ2Et-2+  . . . + µqEt-q

An ARIMA model, as stated earlier is the combination of both the AR and MA part. An ARIMA model therefore in plain English looks like:

Predicted Yt = Constant + Linear combination Lags of Y (up to p lags) + Linear Combination of Lagged forecast errors (up to q lags)

Writing the same using mathematical notation:

yt=b1*yt-1+b2*yt-2+ . .. +bp*yt-p+µ1Et-1+ µ2Et-2+  . . . + µqEt-q

Let us now talk about the final part of ARIMA model “I”.

What is “I” in ARIMA?

The “I” stands for Integrated or the differencing component which is represented by d. Broadly speaking, the value of d determines the degree of difference that is required to make the series stationary.

That brings us to our next question. 

What is Stationarity?

Stationarity means that the properties of the time series do not change with reference to the time. But that doesn’t mean the series is static and not changing. What it means is that the inherent properties of the series, like the central tendency and so on, remain similarly related to changing time. 

The algebraic equivalent is thus a linear function, perhaps, and not a constant one; the value of a linear function changes as ? grows, but the way it changes remains constant — it has a constant slope, one value that captures that rate of change.

How do we assess the stationarity?

Majorly there are few ways of assessing whether a series is stationary or not. The most important and widely used one is a statistical measure, which is called the Augmented Dickey-Fuller test or ADF test. 

The ADF test is a unit root test, and by the application, the test finds the presence of a unit root, which is equivalent to a non-stationary component in a time series. The unit root is a characteristic of a time series that makes it non-stationary. If the test is able to find the unit root, the ADF test is positive and p value of the test is greater than the significance level.

The hypotheses in case of an ADF test is:

H0 (Null): Presence of Unit Root i.e., series is non-stationary 

H1 (Alternate): Unit root is not present; i.e., series is stationary 

There are standard packages available in python through which ADF tests can be performed. 

Upon performing the ADF test and when non-stationarity is found, then comes the application of the integrated part of the ARIMA model. The series is difference with lag terms to make it stationary or, in other words, to get away from the unit root component. 

The next question that comes to our mind is now when we know that we need to go for difference, how do we select the degree of differencing required to make the series stationary.

The answer to that lies in the ACF plot or (complete) auto-correlation plot,, which provides the values of auto-correlation of any time series with its lagged values over a series of times. In simple terms, the ACF plot is a wonderful visualization of how the present values of a time series is related to its past values.

To put things into context, while building models, technically, we should avoid multicollinearity in the system. ACF plot does the same thing. By providing a visual understanding of the past values with the present values, it gives a sense of which lag can be considered for a model building. 


The figure below shows what an ACF plot looks like. As can be seen, the blue shaded area is the acceptable zone, and hence for this specific case, first lag onwards, the time series doesn’t have the autocorrelation problem.  

figure

Hope the things we discussed so far have been able to build a foundational concept, to begin with. Let us now see a real-world implementation of the ARIMA model in python. 

The ARIMA function from the statsmodels package in python requires three parameters.

  1. p = No. of lags required for the AR Part
  2. q= no. of lags required for the MA part
  3. d = No. of differencing required for reaching a near stationary series using ADF testing, differencing, and ACF plots

To further simplify, a perfect differencing for a time series may be obtained as the minimum differencing required to get a near-stationary series that roams around a defined mean, and the ACF plot also doesn’t have too many numbers of lags outside the acceptable region. 

To start with let us first import the data from an open-source system and further process it to have a series format with a timestamp associated.

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
 
def parser(x):
    return datetime.strptime('190'+x, '%Y-%m')
 
series = read_csv('C:/Users/KINSUK/Desktop/DS_Teaching/Blog_Writing_Naukri/ARIMA_Model/shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
print(series.head())
series.plot()
pyplot.show()

The dataset head along with the line chart looks like the below:

result

From the graph, it’s quite evident that the series is not stationary. Let us further validate our thought by ADF testing.

def adfuller_test(series):
    result=adfuller(series)
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )
 
    if result[1] <= 0.05:
        print("strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary")
    else:
        print("weak evidence against null hypothesis,indicating it is non-stationary ")
 
adfuller_test(series)

As guessed at the beginning, the series is not stationary. 

Let’s take the first level difference and see if the series is converted to a stationary form.

series_diff_1 = series - series.shift(1)
series_diff_1=series_diff_1.fillna(0)
adfuller_test(series_diff_1)
result2
result3

Post first differencing, the series is indeed now a stationary one. Let us now check with the ACF plot what should be the ideal value for d parameter.

import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(series_diff_1)
pyplot.show()
result4

From the ACF plot, it’s clearly visible that most of our blue lines post lag of 2 falls within the acceptable zone. Hence as a starting point, let’s build the model with d=2.

from statsmodels.tsa.arima.model import ARIMA
 
model = ARIMA(series_diff_1, order=(1,1,2))
model_fit = model.fit()
print(model_fit.summary())
 

The result and fit of the model is presented below:

result5
Also Read: Top 10 Machine Learning Courses Covering Key ML Algorithms

Endnotes: 

In this article, we have discussed and introduced the concept of Time series analysis using the ARIMA model for forecasting. We have also briefly introduced the concepts of stationarity and collinearity analysis using ACF Plots. Always remember that we need to tune the three parameters p,q, and d of the ARIMA model in order to get the best fit model. 

We have also showcased how the ARIMA model is developed using the statsmodels package in python. Hope this will be useful in introducing you all to the endless world of time series forecasting using various methodologies. 

Top Trending Tech Articles:
Career Opportunities after BTech Online Python Compiler What is Coding Queue Data Structure Top Programming Language Trending DevOps Tools Highest Paid IT Jobs Most In Demand IT Skills Networking Interview Questions Features of Java Basic Linux Commands Amazon Interview Questions

Recently completed any professional course/certification from the market? Tell us what liked or disliked in the course for more curated content.

Click here to submit its review with Shiksha Online.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio