Basics of Statistics for Data Science

Basics of Statistics for Data Science

4 mins read3.3K Views Comment
Vikram
Vikram Singh
Assistant Manager - Content
Updated on Dec 7, 2023 14:26 IST

As a data scientist, you must collect a large set of data, clean, validate, analyse, and finally make Decisions using the data and analytical tools. A decision is a choice that we make based on several possibilities having uncertainties.

A decision can be made possible in two ways:

  • Intuition- what we think or what we feel without any logical approach
  • Data or Information – Purely based on Logical and Scientific approach

The quantitative approach of making decisions using the Logical and scientific approaches is the key to Data Science, and the quantitative approach is called Statistics.

In this article, we will discuss the basic concepts of statistics like data type, random variable and probability distributions.

Basics of Statistics for Data Science

 

1. Data and Types:

Data is classified into Population and Sample:

1. Population

  • Collection of all items of interest
  • The number obtained from the Population is called the Parameter

2. Sample

  • Subset of Population
  • The number obtained from the sample is called the Statistics
2022_01_population-vs-Sample.jpg

Populations are hard to define and hard to observe in real life, while samples are less time-consuming and less costly. But the Sample must be Random and Representative.

Randomness:

A random sample is collected when each sample member is chosen from the population strictly by chance.

Representativeness:

A representative sample is a subset of the population that accurately reflects the members of the entire population.

2022_01_Data-Types-and-Level-of-Measurement.jpg
Recommended online courses

Best-suited Statistics for Data Science courses for you

Learn Statistics for Data Science with these high-rated online courses

Free
12 weeks
– / –
12 weeks
– / –
10 days
Free
12 weeks
– / –
8 weeks
– / –
10 days
– / –
12 weeks
– / –
10 days

2. Random Variable and Properties

A random variable is a numerical description of the outcome of a random experiment.

Example:

  • The number of bikes sold by any particular dealer of Hero.
  • Weight of a person in kg.
2022_01_Random-Variables.jpg

Must Check: Introduction to Probability

The expectation for discrete Random Variable:

Let X  be a discrete random variable that takes values x1, x2, x3,……, xn, with PMF(probability mass function) f(x)=P(X=x) then the expected value or mean is given by:

2022_01_expectation-for-discrete-random-variable.jpg

The expectation for discrete Random Variable

Let X be a continuous random variable that takes values x1, x2, x3,……, xn, with PDF(probability density function) f(x) then the expected value or mean is given by:

2022_01_expectation-for-continuous-random-variable.jpg

Variance:

The variance of a random variable(discrete or continuous) is given as:

2022_01_variance.jpg

Standard Deviation:

The standard deviation of a random variable (X) is defined as the square root of variance.

2022_01_standard-deviation.jpg

Must Check: Measure of Central Tendency: Mean, Median, and Mode.

Covariance and Correlation

Covariance is the measure of how much two random variables two vary together; it involves the relationship between two variables, and its value lies between – ∞ and ∞ .

2022_01_covariance.jpg

Correlation indicates how strongly two variables are related, it can involve multiple variables as well, and its value lies between -1 and 1.

2022_01_correlation.jpg

Must Check: statistician data management Online Courses & Certifications

Must Read: Difference Between Covariance and Correlation

3. Distribution

A probability distribution is a function that shows the possible values for a variable and how often they occur.

Binomial Distribution:

Distribution only two possible outcomes are possible, either success or failure and the probability of success and failure is the same. Each trial is independent.

2022_01_binomial-distribution.jpg
2022_01_binomial.jpg

Uniform Distribution:

The distribution in which the probability of getting an outcome on every trial is equally likely.

2022_01_uniform-distribution.jpg

Normal Distribution:

The most important distribution.

It is bell-shaped and symmetric about the mean; the total area under the curve is 1.

All measures of central tendency(Mean, Median, Mode) coincide.

2022_01_normal-distribution.jpg
2022_01_normal-distribution-curve.jpg
  • Standard Normal Distribution: A distribution with a mean of 0 and a standard deviation of 0 is called a standard normal distribution.

Normal Distribution can be standardized using:

2022_01_standard-normal-distribution-conversion.jpg

Represented by N ~ (0,1).

standard normal distribution

 

2022_01_distribution-and-graph.jpg

Must Check: Statistics & Mathematics for Data Science & Data Analytics

Conclusion:

Statistics is one of the most important tools of Data Science; this article highlighted some important features of that not all. This will help you to get a basic understanding of concepts.

Top Trending Articles in Statistics:

Skewness In Statistics | Statistics Interview Question | Basics Of Statistics | Measure Of Central Tendency | Probability Distribution | Inferential Statistics | Measure Of Dispersion | Introduction To Probability | Bayes Theorem | P-Value | Z-Test | T-Test | Chi-Square Test | Outliers In Python | Sampling and Resampling | Regression Analysis In Machine Learning | Gradient Descent | Normal Distribution | Poisson Distribution | Binomial Distribution | Covariance And Correlation | Conditional Probability | Central Limit Theorem

FAQs

What is Sample and Population?

1. Population Collection of all items of interest The number obtained from Population is called Parameter 2. Sample Subset of Population The number obtained from the sample is called Statistics

What is a Random Variable?

A random variable is a numerical description of the outcome of a random experiment. Example: The number of bikes sold by any particular dealer of Hero. Weight of a person in kg.

What is a Randomness in Data?

A random sample is collected when each member of the sample is chosen from the population strictly by chance.

What is a Representativeness in Data?

A representative sample is a subset of the population that accurately reflects the members of the entire population.

About the Author
author-image
Vikram Singh
Assistant Manager - Content

Vikram has a Postgraduate degree in Applied Mathematics, with a keen interest in Data Science and Machine Learning. He has experience of 2+ years in content creation in Mathematics, Statistics, Data Science, and Mac... Read Full Bio