Basics of Statistics for Data Science

4 mins read3.3K Views Comment

Updated on Dec 7, 2023 14:26 IST

As a data scientist, you must collect a large set of data, clean, validate, analyse, and finally make Decisions using the data and analytical tools. A decision is a choice that we make based on several possibilities having uncertainties.

A decision can be made possible in two ways:

Intuition- what we think or what we feel without any logical approach
Data or Information – Purely based on Logical and Scientific approach

The quantitative approach of making decisions using the Logical and scientific approaches is the key to Data Science, and the quantitative approach is called Statistics.

In this article, we will discuss the basic concepts of statistics like data type, random variable and probability distributions.

1. Data and Types:

Data is classified into Population and Sample:

1. Population

Collection of all items of interest
The number obtained from the Population is called the Parameter

2. Sample

Subset of Population
The number obtained from the sample is called the Statistics

Populations are hard to define and hard to observe in real life, while samples are less time-consuming and less costly. But the Sample must be Random and Representative.

Randomness:

A random sample is collected when each sample member is chosen from the population strictly by chance.

Representativeness:

A representative sample is a subset of the population that accurately reflects the members of the entire population.

Recommended online courses

Best-suited Statistics for Data Science courses for you

Learn Statistics for Data Science with these high-rated online courses

Discontinued (Aug 2024)- Post Graduate Diploma in Applied Statistics

Centre for Online EducationCertificate

4.0

Total Fees

– / –

Duration

12 months

Spatial Statistics And Spatial Econometrics

IIIT DelhiCertificate

Total Fees

Free

Duration

12 weeks

NISM-Series-XIII: Common Derivatives Certification Examination, National Institute of Securities Markets

National Institute of Securities MarketsCertificate

5.0

Total Fees

₹3 K

Duration

3 hours

Introduction to Statistics

IIT HyderabadCertificate

Total Fees

– / –

Duration

12 weeks

Maths for CS I: Probability & Statistics

IIIT HyderabadCertificate

Total Fees

– / –

Duration

10 days

Probability I with Examples Using R

Indian Statistical Institute, DelhiCertificate

Total Fees

Free

Duration

12 weeks

Discontinued (Aug 2024)- Linear Dynamical Systems

IIT MandiCertificate

Total Fees

– / –

Duration

8 weeks

Modern Complexity Theory

IIIT HyderabadCertificate

Total Fees

– / –

Duration

10 days

Discontinued (October,2024)-Statistical Mechanics

IISER MohaliCertificate

Total Fees

– / –

Duration

12 weeks

Probability for Comp. Sci.

IIIT HyderabadCertificate

Total Fees

– / –

Duration

10 days

2. Random Variable and Properties

A random variable is a numerical description of the outcome of a random experiment.

Example:

The number of bikes sold by any particular dealer of Hero.
Weight of a person in kg.

Must Check: Introduction to Probability

The expectation for discrete Random Variable:

Let X be a discrete random variable that takes values x₁, x₂, x₃,……, x_n, with PMF(probability mass function) f(x)=P(X=x) then the expected value or mean is given by:

The expectation for discrete Random Variable:

Let X be a continuous random variable that takes values x₁, x₂, x₃,……, x_n, with PDF(probability density function) f(x) then the expected value or mean is given by:

Variance:

The variance of a random variable(discrete or continuous) is given as:

Standard Deviation:

The standard deviation of a random variable (X) is defined as the square root of variance.

Must Check: Measure of Central Tendency: Mean, Median, and Mode.

Covariance and Correlation

Covariance is the measure of how much two random variables two vary together; it involves the relationship between two variables, and its value lies between – ∞ and ∞ .

Correlation indicates how strongly two variables are related, it can involve multiple variables as well, and its value lies between -1 and 1.

Must Check: statistician data management Online Courses & Certifications

Must Read: Difference Between Covariance and Correlation

3. Distribution

A probability distribution is a function that shows the possible values for a variable and how often they occur.

Binomial Distribution:

Distribution only two possible outcomes are possible, either success or failure and the probability of success and failure is the same. Each trial is independent.

Uniform Distribution:

The distribution in which the probability of getting an outcome on every trial is equally likely.

Normal Distribution:

The most important distribution.

It is bell-shaped and symmetric about the mean; the total area under the curve is 1.

All measures of central tendency(Mean, Median, Mode) coincide.

Standard Normal Distribution: A distribution with a mean of 0 and a standard deviation of 0 is called a standard normal distribution.

Normal Distribution can be standardized using:

Represented by N ~ (0,1).

Must Check: Statistics & Mathematics for Data Science & Data Analytics

Conclusion:

Statistics is one of the most important tools of Data Science; this article highlighted some important features of that not all. This will help you to get a basic understanding of concepts.

FAQs

What is Sample and Population?

1. Population Collection of all items of interest The number obtained from Population is called Parameter 2. Sample Subset of Population The number obtained from the sample is called Statistics

What is a Random Variable?

A random variable is a numerical description of the outcome of a random experiment. Example: The number of bikes sold by any particular dealer of Hero. Weight of a person in kg.

What is a Randomness in Data?

A random sample is collected when each member of the sample is chosen from the population strictly by chance.

What is a Representativeness in Data?

A representative sample is a subset of the population that accurately reflects the members of the entire population.

About the Author

Vikram Singh

Basics of Statistics for Data Science

1. Data and Types:

Randomness:

Representativeness:

Best-suited Statistics for Data Science courses for you

Discontinued (Aug 2024)- Post Graduate Diploma in Applied Statistics

Spatial Statistics And Spatial Econometrics

NISM-Series-XIII: Common Derivatives Certification Examination, National Institute of Securities Markets

Introduction to Statistics

Maths for CS I: Probability & Statistics

Probability I with Examples Using R

Discontinued (Aug 2024)- Linear Dynamical Systems

Modern Complexity Theory

Discontinued (October,2024)-Statistical Mechanics

Probability for Comp. Sci.

2. Random Variable and Properties

Must Check: Introduction to Probability

The expectation for discrete Random Variable:

The expectation for discrete Random Variable:

Variance:

Standard Deviation:

Covariance and Correlation

3. Distribution

Binomial Distribution:

Uniform Distribution:

Normal Distribution:

Conclusion:

FAQs

Top Picks & New Arrivals