Market Basket Analysis in Python

6 mins read3.7K Views Comment

Updated on Oct 3, 2023 12:19 IST

In this article, we will discuss the application of machine learning in retail industry for predicting sales and customer segmentation using Apriori Algorithm.

Introduction

Machine Learning finds various applications in the Retail Industry from predicting product sales to customer segmentation, etc. One of such popular applications is the Market Basket Analysis – an unsupervised learning technique based on the method of Association.

In this article, we will discuss Association Rules and use Apriori Algorithm, commonly used in data mining, to solve the problem of generating frequently bought item sets. We will implement this entire process in Python using the Pandas and Mlxtend libraries on a large-scale dataset.

Recommended online courses

Best-suited Python courses for you

Learn Python with these high-rated online courses

Programming Languages

Indian Institute of Hardware Technology, GurgaonCertificate

Total Fees

– / –

Duration

40 hours

Python Course

GKTCS InnovationsCertificate

Total Fees

– / –

Duration

5 days

Certificate in Jython

GKTCS InnovationsCertificate

Total Fees

– / –

Duration

3 days

PLC Programming

CRISPCertificate

Total Fees

₹3 K

Duration

3 weeks

Django - High Level Web framework

GKTCS InnovationsCertificate

Total Fees

– / –

Duration

4 days

Certificate in Python

Techdata Solution, PuneCertificate

Total Fees

– / –

Duration

20 hours

Scripting - Python

BlackBox Digital Lab - School of Visual EffectsCertificate

Total Fees

– / –

Duration

2 months

Certificate in Python

High Technologies Solutions (Delhi I Noida I Gurgaon) - HTS, KalkajiCertificate

Total Fees

– / –

Duration

1 year

Python Training

IIT BombayCertificate

4.2

Total Fees

Free

Duration

6 weeks

Databases and SQL for Data Science with Python

IBM - Institute of Business ManagementCertificate

Total Fees

– / –

Duration

3 months

Table of Content

What is Market Basket Analysis?
Association Rule Mining
Apriori Algorithm
- How Apriori Works
Performing Market Basket Analysis with Apriori Algorithm

What is Market Basket Analysis?

Do you recall ever noticing the upsell features such as ‘frequently bought together or ‘customers also bought’ on any of the e-commerce websites?

Maybe something like this on Amazon:

This is a classic example of Market Basket Analysis and how it is used to uncover the association between items that are frequently bought together.

As shown in the image above, retailers try to upsell their items by giving a discount when bought as a pack. This baits the customers to buy all three items instead of just one because people sure do love discounts (and cookies)!

The entire process of analyzing the buying trend of customers to identify the relationship between items can be represented by the following equation:

This equation is called the Association Rule. If item X is bought by a customer, there is a likelihood of item Y being bought in the same transaction.

Here, X is the Antecedent and Y is the Consequent.

Must Check: What is Machine Learning?

Must Check: Machine Learning Online Courses & Certifications

Let’s understand this in detail below.

Bagging Technique in Ensemble Learning

In this article, we will discuss the concept how to solve machine learning problems using the ensemble learning bagging.

Read Later

Ridge Regression vs Lasso Regression

Regularization techniques (like Lasso and ridge regression) are used to enhance the performance and interpretability of the machine-learning model, especially in situations where the traditional linear regression method fails. Lasso...read more

Read Later

Association Rule Mining

Association Rule Mining or Frequent Itemset Mining is the technique used to perform Market Basket Analysis. It helps in discovering associations and correlations between items in huge transactional datasets. Finding such correlational relationships helps businesses in decision-making processes, marketing campaigns, and customer segmentation.

As we discussed in the above section, the association rule states: X -> Y, where

X and Y are item sets in a given set of items I = {X1, X2, X3, …. , Xn}, where n >= 2.
An itemset X can represent a set of items in I and is called an n-item setif it contains n items.
So, if the transaction database consists of n items, then the number of possible large item sets is 2ⁿ.

Association Rule Mining identifies the relationship between the items purchased in different transactions with the help of the Apriori algorithm. But before getting into it, let’s discuss the three main association rule parameters –

Support

Support is the fraction of transactions in a database that indicates the frequency of the items in the data.

The support of item X can be given as:

Confidence

Confidence is the likelihood of buying item Y when item X has already been purchased.

The confidence of X -> Y can be given as:

Lift

Lift is the correlation measure that defines the importance of the association rule. It basically compares confidence to expected confidence. The lift can be given as:

Let’s understand with the help of a simple example.

Consider the following transaction data:

Using the formulas for Support, Confidence, Lift we calculate the values for each of the association rules:

So, in Association Rule Mining we perform the following steps:

Find all frequent itemsets:

A frequent itemset is one whose support is greater than or equal to the minimum support threshold (minSup).

Generate Association Rules from the frequent itemsets:

A strong association rule must satisfy both – a minimum support threshold (minSup), and a minimum confidence threshold (minConf).

Predicting Categorical Data Using Classification Algorithms

This article will demonstrate how you can build classification models using ML’s favorite programming language – Python.

Read Later

Top 10 concepts and Technologies in Machine learning

This blog will make you acquainted with new technologies in machine learning

Read Later

Movie Recommendation System using Machine Learning

Every time you open up YouTube just to figure out the solution to your problem or just get the latest news, you end up spending more time. A similar thing...read more

Read Later

Apriori Algorithm

Apriori algorithm is the most commonly used algorithm used for Market Basket Analysis.

It is used for mining frequent itemsets and generating relevant association rules.
It works better on large databases containing a lot of transactions.

The main idea behind Apriori is that all subsets of a frequent itemset must also be frequent. Similarly, for an infrequent itemset, all its supersets must also be infrequent.

How Apriori Works

Apriori algorithm scans the transaction database D to count the support of each item i in a set of items I in and determines the set of large 1-itemsets.

After that, iterations are performed to compute support for the set of 2-itemsets, 3-itemsets, and so on…

The n^th iteration consists of two steps:

1. Generate a candidate set C_n from the set of large (k-1)-itemsets, that has a length L _(n-1)

2. Scan the database to compute the support of each itemset in C_n

The number of scans over the transaction database is the same as the length of the possible maximal itemset whose superset is infrequent.

Performing Market Basket Analysis with Apriori Algorithm

Problem Statement:

For demonstration, we are going to perform Market Basket Analysis on retail data using the Apriori Algorithm. The dataset provided needs to be pre-processed beforehand to make it usable.

Let’s get started!

The dataset has 8 features as given below:

InvoiceNo – Invoice number
StockCode – Stock code of the item
Description – Description of item
Quantity – Quantity of item
InvoiceDate – Data of invoice generation
UnitPrice – Price of item per unit
CustomerID – ID of the customer
Country – Country of residence

Tasks to be performed:

Importing required libraries
Loading the data
Cleaning the data
Creating basket
Encoding
Generating frequent itemsets
Generating association rules
Filtering rules with high Confidence and Lift

Step 1 – Importing required libraries

We will be using the mlxtend Apriori library in Python, as shown below:

#Import required libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Step 2 – Loading the data

#Load the data
data = pd.read_csv('data.csv', encoding= 'unicode_escape')

Step 3 – Cleaning the data

#Remove spaces from the description column
data['Description'] = data['Description'].str.strip()
 
#Drop rows without invoice number
data.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
data['InvoiceNo'] = data['InvoiceNo'].astype('str')
 
#Remove the credit transaction with invoice numbers containing 'C'
data = data[~data['InvoiceNo'].str.contains('C')]
data.head()

Step 4 – Creating basket

We are going to create a basket matrix by grouping multiple items within the same order and then unstack our DataFrame:

basket = (data[data['Country'] =="France"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum()
          .unstack()
          .reset_index()
          .fillna(0)
          .set_index('InvoiceNo'))
basket.head(10)

Step 5 – Encoding

We now need to encode the values in our matrix to 1’s and 0’s. We will do this in the following manner:

Set the value to 0 if it is less than or equal to 0.
Set the value to 1 if it is greater than or equal to 1.

def encode_units(x):
    if x <= 0:
        return 0    
    if x >= 1:
        return 1
basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)
basket_sets.head()

Step 6 – Generating frequent item sets

We will generate frequent item sets that have a support of at least 7%:

#Generate frequent itemsets
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

Step 7 – Generating association rules

We will generate association rules with their corresponding support, confidence, and lift:

#Generating the rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
 
rules.head()

Step 8 – Filtering rules with high Confidence and Lift

As we already know, the greater the lift ratio, the more significant the association between the items. Also, the higher the confidence, the more reliable are the rules.

So, we are looking for rules with high confidence (>=0.8) and high lift (>=6). For this we filter the records as shown:

#Filtering out the values with lift > = 6 and confidence > = 0.8
rules[ (rules['lift'] >= 6) & (rules['confidence'] >= 0.8) ]

So, we have our antecedents and consequent items along with their parameter values.

The result is giving us a lot of information about item grouping such as customers are 84% likely to buy the green alarm clock if they have already bought the red alarm clock!

Conclusion

Association Rule Mining through the Apriori algorithm is a widely known technique in machine learning to perform Market Basket Analysis. This analyzing process is used to discover the association between products in a manner that will be helpful for retailers or marketers in developing useful marketing strategies and business plans.

About the Author

Shiksha Online

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio

Market Basket Analysis in Python

Introduction

Best-suited Python courses for you

Programming Languages

Python Course

Certificate in Jython

PLC Programming

Django - High Level Web framework

Certificate in Python

Scripting - Python

Certificate in Python

Python Training

Databases and SQL for Data Science with Python

Table of Content

What is Market Basket Analysis?

Association Rule Mining

Support

Confidence

Lift

Apriori Algorithm

How Apriori Works

Performing Market Basket Analysis with Apriori Algorithm

Conclusion

Top Picks & New Arrivals