Cluster Analysis: Overview, Questions, Preparation

Statistics 2021 ( Statistics )

Rachit Kumar Saxena

Rachit Kumar SaxenaManager-Editorial

Updated on Aug 5, 2021 08:59 IST

What is Cluster Analysis?

Cluster Analysis is a concept that is the basis of logical reasoning, where to draw conclusions, the samples are placed in similar “groups” or “clusters”. This is called clustering. Many guidelines are used to form these clusters, which creates an integral part of data management in statistical analysis.
In a population of “n” individuals, understanding the characteristics of every individual is not feasible. Hence, if the “n” individuals are divided into groups, understanding their characteristics is easier.

A cluster plot is characterised as having high internal homogeneity and high external heterogeneity.

Cluster

Here the colours represent different sets of data. You can easily calculate the relatedness of two sets of data. The pink and blue representing data are closer in relatedness than the green.

Types of Cluster Analysis

1. Hierarchical Cluster Analysis

Here one cluster is formed and grouped with another similar cluster which is grouped with another forming one large Agglomerated Cluster. The opposite of this is Divisive Clustering.

2. Centroid-Based Clustering

Here, there will be one central entity around with similar data is clustered. K-Means Method of clustering is often used.

3. Distribution-Based Clustering

Objects belonging to the same distribution are put into a single cluster. This type of clustering can help to analyse by correlation and dependence between attributes.

4. Density-Based Clustering

Here, clusters are defined by the higher density areas than the remaining of the data set. Objects in sparse areas considered noise or border points. This helps eliminate out-of-range data.

Weightage of Clustering

This topic is just about making the students of Class 12 aware of this method of segregation. This topic has indirect weightage in the exams as not many practical questions will be asked, but definitions will come. There might be just 1 or 2 marks coming from this part, but this is used in the analytics in national entrance exams. Later on, this will be employed in biostatistics and research.

Illustrated Examples on Clustering

1. If four sets of data A, B, C and D are represented in a cluster plot and one cluster B is found inside D cluster, and the other two are far apart, what can you conclude from it?

Solution.

  • The B and D clusters are agglomerative and show that they have similar properties.
  • There are three distinct clusters in the cluster plot and not four.
  • The difference between cluster B and D is that B has all the D properties, but the points other than B in D have different properties.

2. A strain of bacteria ‘a’ is resistant to antibiotic Amp and Tet. Another strain ‘b’ is resistant only to Amp. While another strain ‘c’ is resistant to Tet and Azi. How would you represent this?

Solution.

Strain ‘b’  cluster can be represented within or very close to strain ‘a’ cluster and strain ‘c’ cluster will overlap a little with strain ‘a’ cluster.

3.A data group of i, ii, iii, iv and v were plotted, and the first four formed a distinct cluster whereas v was found far away. What could be the reasons?

Solution.

The v data term must be an outlier or border value or could be an error in recording.

FAQs on Cluster Analysis

Q: Define Cluster Analysis

A: Cluster analysis is a multivariate data mining technique that groups objects based on a set of user-selected characteristics or attributes.

Q: What does a Cluster Plot of similar attributes look like?

A: When plotted geometrically, objects within clusters should be very close together, and clusters will be far apart. 

Q: What are outliers?

A: These are data that lie outside a cluster.

Q: What type is the K-means clustering method? Is the centroid value a part of the data?

A: It is a Centroid-based clustering method where the central value may or may not be a part of the data.

Q: What is Cluster Analysis used for?

A: It is used for analysing grouped data for easy characterisation of results and conclusions.
qna

Statistics Exam

Student Forum

chatAnything you would want to ask experts?
Write here...