ARIMA Model in Time Series

9 mins read896 Views Comment

Updated on Jun 8, 2022 12:10 IST

The below article explains the ARIMA Model in Time Series extensively. It covers ARIMA features and functions in detail along with examples.

Components of a Time Series
Time Series Methods
Decoding ARIMA
Endnotes

ARIMA Model in Time Series

As we covered in one of our previous articles, “A time series is just a sequence of several data points or observations (Sales, demands, stock price, and so on) that occurred in a consecutive order for a given period of time.” So, the next time you plan to guess the share price of Infosys or the sales volume of Ferro Rocher chocolates in a certain region, time series analysis is going to be the tool of choice.

Time series analysis is a forecasting method that takes into consideration previous data points and also the independent drivers in some times and uses those data points to predict how the future would look like.

Components of a Time Series

Trend: It is the constant upward or downward movement of the Time series data. In economic terms, the length to observe a trend should be >=10 years.

Seasonality: Seasonality or seasonal component of a time series may be defined as the repetitive upward or downward movement (or fluctuations) within some fixed intervals. For example, if we talk about sales of garments and clothes or apparel going up every year during the festive seasons of Diwali, Christmas, etc., in India. Those periodic up movements, as depicted in the below figures as well are termed seasonality.

Cyclicality: Cyclical variations also have recurring patterns but with a longer and more erratic time scale compared to Seasonal variations. This happens mostly due to macroeconomic changes like recession, unemployment, etc. These cycles can be far from regular, and it is usually impossible to predict just how long periods of expansion or contraction will be. They usually last for 2 – 10 Years (Economic cycles etc.)

Irregular component: Irregular components of a time series data may be defined as unexpected situations/events/scenarios and spikes in a short time span. Some examples of irregularity may be attributed to sudden changes in interest rates, the collapse of companies, natural disasters, shifts in government policies, and so on.

Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

Amity OnlineDegree

Total Fees

₹1.7 L

Duration

2 years

Advance Certification in Applied Data Science, Machine Learning & IoT

IIT GuwahatiCertificate

4.0

Total Fees

₹95 K

Duration

9 months

Professional Certificate Course In Generative AI And Machine Learning

IIT KanpurCertificate

Total Fees

₹1.53 L

Duration

11 months

MCA in Machine Learning Online

Amity OnlineDegree

Total Fees

₹2.5 L

Duration

2 years

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

TimesProCertificate

4.0

Total Fees

₹2 L

Duration

10 months

Data Science & Machine Learning Course

Coding NinjasCertificate

4.8

Total Fees

₹34.65 K

Duration

11 months

MCA in Machine Learning

Amity University Online, NoidaDegree

Total Fees

₹2.5 L

Duration

2 years

Full Stack Machine Learning & AI Program

Jigsaw AcademyCertificate

Total Fees

– / –

Duration

8 hours

M.Sc. in Machine Learning and AI

upGradDegree

Total Fees

₹5.6 L

Duration

18 months

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

IIT RoorkeeCertificate

Total Fees

– / –

Duration

6 months

Time Series Methods

Time series forecasting can be done in various ways. Some of them are straightforward by using simple average, moving average, or exponential smoothening, and in some cases, they use advanced statistical modelling using lagged terms. We are going to discuss one of that methodologies today called ARIMA or Auto-Regressive Integrated Moving Average method for time series forecasting.

Decoding ARIMA

Let’s first try to understand what are the individual components of an ARIMA model. An ARIMA model in a broad sense has two major components. One of them is called the Autoregressive part and the second one is termed as Moving Average part. Let’s try to understand what these individual components are all about and how they create the final forecasting model together.

Let’s first talk about the autoregressive part. As the name suggests the autoregression method involves forming a linear regression by considering the lagged version of the dependent variable as an independent driver. To further simplify; say at any point in time the dependent time series variable is y_t. We will create a linear regression equation to explain y_tby using the lagged version of it; e.g.: y_t-1or y_t-2and so on.

An AR model with lag p is represented as:

yt=C+b1*yt-1+b2*yt-2+ . . +bp*yt-p+E

Where yt) is the observation at time t

yt-1 is the observation just before yt ; C is the constant and E is the error term. So, in simple term yt here is a function of its own lags.

Let’s now talk about the MA or Moving Average part of the ARIMA model. Similar to AR; for the MA it is a function of the lagged errors and not yt itself. The only difference between AR and MA lies in the independent driver. For AR its lagged version of the dependent variable yt and for MA it’s the lagged version of the error E

A typical Moving Average equation of lag q would therefore look like:

yt= α+Et+ µ1Et-1+ µ2Et-2+ . . . + µqEt-q

An ARIMA model, as stated earlier is the combination of both the AR and MA part. An ARIMA model therefore in plain English looks like:

Predicted Yt = Constant + Linear combination Lags of Y (up to p lags) + Linear Combination of Lagged forecast errors (up to q lags)

Writing the same using mathematical notation:

yt=b1*yt-1+b2*yt-2+ . .. +bp*yt-p+µ1Et-1+ µ2Et-2+ . . . + µqEt-q

Let us now talk about the final part of ARIMA model “I”.

What is “I” in ARIMA?

The “I” stands for Integrated or the differencing component which is represented by d. Broadly speaking, the value of d determines the degree of difference that is required to make the series stationary.

That brings us to our next question.

What is Stationarity?

Stationarity means that the properties of the time series do not change with reference to the time. But that doesn’t mean the series is static and not changing. What it means is that the inherent properties of the series, like the central tendency and so on, remain similarly related to changing time.

The algebraic equivalent is thus a linear function, perhaps, and not a constant one; the value of a linear function changes as ? grows, but the way it changes remains constant — it has a constant slope, one value that captures that rate of change.

How do we assess the stationarity?

Majorly there are few ways of assessing whether a series is stationary or not. The most important and widely used one is a statistical measure, which is called the Augmented Dickey-Fuller test or ADF test.

The ADF test is a unit root test, and by the application, the test finds the presence of a unit root, which is equivalent to a non-stationary component in a time series. The unit root is a characteristic of a time series that makes it non-stationary. If the test is able to find the unit root, the ADF test is positive and p value of the test is greater than the significance level.

The hypotheses in case of an ADF test is:

H0 (Null): Presence of Unit Root i.e., series is non-stationary

H1 (Alternate): Unit root is not present; i.e., series is stationary

There are standard packages available in python through which ADF tests can be performed.

Upon performing the ADF test and when non-stationarity is found, then comes the application of the integrated part of the ARIMA model. The series is difference with lag terms to make it stationary or, in other words, to get away from the unit root component.

The next question that comes to our mind is now when we know that we need to go for difference, how do we select the degree of differencing required to make the series stationary.

The answer to that lies in the ACF plot or (complete) auto-correlation plot,, which provides the values of auto-correlation of any time series with its lagged values over a series of times. In simple terms, the ACF plot is a wonderful visualization of how the present values of a time series is related to its past values.

To put things into context, while building models, technically, we should avoid multicollinearity in the system. ACF plot does the same thing. By providing a visual understanding of the past values with the present values, it gives a sense of which lag can be considered for a model building.

The figure below shows what an ACF plot looks like. As can be seen, the blue shaded area is the acceptable zone, and hence for this specific case, first lag onwards, the time series doesn’t have the autocorrelation problem.

Hope the things we discussed so far have been able to build a foundational concept, to begin with. Let us now see a real-world implementation of the ARIMA model in python.

**The ARIMA function from the statsmodels package in python requires three parameters.**

p = No. of lags required for the AR Part
q= no. of lags required for the MA part
d = No. of differencing required for reaching a near stationary series using ADF testing, differencing, and ACF plots

To further simplify, a perfect differencing for a time series may be obtained as the minimum differencing required to get a near-stationary series that roams around a defined mean, and the ACF plot also doesn’t have too many numbers of lags outside the acceptable region.

To start with let us first import the data from an open-source system and further process it to have a series format with a timestamp associated.

from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
 
def parser(x):
    return datetime.strptime('190'+x, '%Y-%m')
 
series = read_csv('C:/Users/KINSUK/Desktop/DS_Teaching/Blog_Writing_Naukri/ARIMA_Model/shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
print(series.head())
series.plot()
pyplot.show()

The dataset head along with the line chart looks like the below:

From the graph, it’s quite evident that the series is not stationary. Let us further validate our thought by ADF testing.

def adfuller_test(series):
    result=adfuller(series)
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )
 
    if result[1] <= 0.05:
        print("strong evidence against the null hypothesis(Ho), reject the null hypothesis. Data is stationary")
    else:
        print("weak evidence against null hypothesis,indicating it is non-stationary ")
 
adfuller_test(series)

As guessed at the beginning, the series is not stationary.

Let’s take the first level difference and see if the series is converted to a stationary form.

series_diff_1 = series - series.shift(1)
series_diff_1=series_diff_1.fillna(0)
adfuller_test(series_diff_1)

Post first differencing, the series is indeed now a stationary one. Let us now check with the ACF plot what should be the ideal value for d parameter.

import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
plot_acf(series_diff_1)
pyplot.show()

From the ACF plot, it’s clearly visible that most of our blue lines post lag of 2 falls within the acceptable zone. Hence as a starting point, let’s build the model with d=2.

from statsmodels.tsa.arima.model import ARIMA
 
model = ARIMA(series_diff_1, order=(1,1,2))
model_fit = model.fit()
print(model_fit.summary())

The result and fit of the model is presented below:

Also Read: Top 10 Machine Learning Courses Covering Key ML Algorithms

Endnotes:

In this article, we have discussed and introduced the concept of Time series analysis using the ARIMA model for forecasting. We have also briefly introduced the concepts of stationarity and collinearity analysis using ACF Plots. Always remember that we need to tune the three parameters p,q, and d of the ARIMA model in order to get the best fit model.

We have also showcased how the ARIMA model is developed using the statsmodels package in python. Hope this will be useful in introducing you all to the endless world of time series forecasting using various methodologies.

Recently completed any professional course/certification from the market? Tell us what liked or disliked in the course for more curated content.

Click here to submit its review with Shiksha Online.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

ARIMA Model in Time Series

Contents

Also Read: What is the Future of Machine Learning?

Components of a Time Series

Best-suited Machine Learning courses for you

Master of Computer Applications with specialization in Machine Learning and Artificial Intelligence (Online MCA)

Advance Certification in Applied Data Science, Machine Learning & IoT

Professional Certificate Course In Generative AI And Machine Learning

MCA in Machine Learning Online

IIT Roorkee - Post Graduate Certificate Program in Data Science & Machine Learning (Online)

Data Science & Machine Learning Course

MCA in Machine Learning

Full Stack Machine Learning & AI Program

M.Sc. in Machine Learning and AI

IIT Roorkee & Wiley Post Graduate Certification in AI for BFSI

Time Series Methods

Decoding ARIMA

An AR model with lag p is represented as:

What is “I” in ARIMA?

What is Stationarity?

How do we assess the stationarity?

The hypotheses in case of an ADF test is:

**The ARIMA function from the statsmodels package in python requires three parameters.**

Also Read: Top 10 Machine Learning Courses Covering Key ML Algorithms

Endnotes:

Top Picks & New Arrivals