A Comprehensive Guide to Probability & Statistics for Data Science
This is a long long blog post, hence I have divided it into parts, this first part covers the context, table of content for the overall blog post. We will be covering majorly these topics:
- Probability
- Descriptive Statistics
- Inferential Statistics
- Bayesian Statistics
- Statistical Learning
When I look at the literature available on probability & statistics, I find it too theoretical and generalized. I have felt that there should be some content on probability & statistics specifically focused on data science.
I want to cover here everything about probability & statistics from basics to statistical learning. I would like to mention that my focus in these posts would be to give intuition on every topic and how it relates to data science rather than going deep into mathematical formulas or their implementation in the real world.
This blog post contains 6 parts, this one is the first part which gives an overview and sets the context of subsequent parts.
The second part will cover probability & its types, random variables & probability distributions, and how they are important from a data science perspective.
Probability
- Introduction
- Conditional Probability
- Random Variables
- Probability Distributions
The third, Fourth & Fifth parts will cover every topic related to statistics & its significance in data science.
Statistics
- Introduction
- Descriptive Statistics
- Inferential Statistics
- Bayesian Statistics
The sixth (final) part will cover statistical learning, it will be about looking at machine learning or data science from a statistical perspective.
Statistical Learning
- Introduction
- Prediction & Inference
- Parametric & Non-parametric methods
- Prediction Accuracy and Model Interpretability
- Bias-Variance Trade-Off
This is the 2nd part of the blog post ‘Probability & Statistics for Data Science’, this part covers these topics related to probability and their significance in data science.
- Introduction
- Conditional Probability
- Random Variables
- Probability Distributions
Probability
Probability is the chance that something will happen — how likely it is that some event will happen.
Probability of an event happening P(E) = Number of ways it can happen n(E)/ Total number of outcomes n(T)
Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty.
Why probability is important?
Uncertainty and randomness occur in many aspects of our daily life and having a good knowledge of probability help us make sense of these uncertainties. Learning about probability helps us make informed judgments on what is likely to happen, based on a pattern of data collected previously or an estimate.
How Probability is used in Data Science?
Data science often uses statistical inferences to predict or analyze trends from data, while statistical inferences use probability distributions of data. Hence knowing probability and its applications are important to work effectively on data science problems.
Conditional Probability
Conditional probability is a measure of the probability of an event (some particular situation occurring) given that (by assumption, presumption, assertion, or evidence) another event has occurred.
The probability of event B given event A equals the probability of event A and event B divided by the probability of event A.
How conditional probability is used in data science?
Many data science techniques (i.e. Naive Bayes) rely on Bayes’ theorem.
Bayes’ theorem is a formula that describes how to update the probabilities of hypotheses when given evidence.
Using the Bayes’ theorem, it's possible to build a learner that predicts the probability of the response variable belonging to some class, given a new set of attributes.
Random Variables
A random variable is a set of possible values from a random experiment.
A random variable (random quantity, aleatory variable, or stochastic variable) is a variable whose possible values are outcomes of a random phenomenon.
Random variables can be discrete or continuous. Discrete random variables can only take certain values while continuous random variables can take any value (within a range).
Probability Distributions
The probability distribution for a random variable describes how the probabilities are distributed over the values of the random variable.
For a discrete random variable, x, the probability distribution is defined by a probability mass function, denoted by f(x). This function provides the probability for each value of the random variable.
For a continuous random variable, since there is an infinite number of values in any interval, the probability that a continuous random variable will lie within a given interval is considered. So here, the probability distribution is defined by the probability density function, also denoted by f(x).
Both probability functions must satisfy two requirements:
- (1) f(x) must be non-negative for each value of the random variable, and
- (2) the sum of the probabilities for each value (or integral overall values) of the random variable must equal one.
Types of probability distributions
A binomial distribution is a statistical experiment that has the following properties: The experiment consists of n repeated trials. Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure. The probability of success, denoted by P, is the same on every trial.
The normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. It has the following properties:
- The normal curve is symmetrical about the mean μ;
- The mean is at the middle and divides the area into halves;
- The total area under the curve is equal to 1;
- It is completely determined by its mean and standard deviation σ (or variance σ2)
Other common probability distributions are Bernoulli, Uniform, Poisson, and Exponential distributions, which are not in the scope of this blog post.
How random variables & probability distributions are used in data science?
Data science often uses statistical inferences to predict or analyze trends from data, while statistical inferences use probability distributions of data. Hence knowing random variables & their probability distributions are important to work effectively on data science problems.
This is the 3rd part of the blog post ‘Probability & Statistics for Data Science’, this part covers these topics related to descriptive statistics and their significance in data science.
- Introduction to Statistics
- Descriptive Statistics
- Uni-variate Analysis
- Bi-variate Analysis
- Multivariate Analysis
- Function Models
- Significance in Data Science
Statistics Introduction
Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data.
Statistics, in short, is the study of data. It includes descriptive statistics (the study of methods and tools for collecting data, and mathematical models to describe and interpret data) and inferential statistics (the systems and techniques for making probability-based decisions and accurate predictions.
Population vs Sample
Population means the aggregate of all elements under study having one or more common characteristics while sample is a part of population chosen at random for participation in the study.
Descriptive Statistics
A descriptive statistic is a summary statistic that quantitatively describes or summarizes features of a collection of information. Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at hand.
Types of Variable
Dependent and Independent Variables: An independent variable (experimental or predictor) is a variable that is being manipulated in an experiment in order to observe the effect on a dependent variable (outcome).
Categorical and Continuous Variables: Categorical variables (qualitative) represent types of data that may be divided into groups. Categorical variables can be further categorized as either nominal, ordinal or dichotomous. Continuous variables (quantitative) can take any value. Continuous variables can be further categorized as either interval or ratio variables.
Central Tendency
Central tendency is a central or typical value for distribution. It may also be called a center or location of the distribution. The most common measures of central tendency are the arithmetic mean, the median, and the mode.
Mean is the numerical average of all values, median is directly in the
middle of the data set while mode is the most frequent value in the data set.
Spread or Variance
Spread (dispersion or variability) is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and inter-quartile range (IQR).
Inter-quartile range (IQR) is the distance between the 1st quartile and 3rd quartile and gives us the range of the middle 50% of our data. Variance is the average of the squared differences from the meanwhile standard deviation is the square root of the variance.
Upper outliers: Q3+1.5 ·IQR
Lower outliers: Q1–1.5 ·IQR
Standard Score or Z score: For an observed value x, the Z score finds the number of standard deviations x is away from the mean.
The Central Limit Theorem is used to help us understand the following facts regardless of whether the population distribution is normal or not:
- the mean of the sample means is the same as the population mean
- the standard deviation of the sample means is always equal to the standard error.
- the distribution of sample means will become increasingly more normal as the sample size increases.
Univariate Analysis
In univariate analysis, appropriate statistic depends on the level of measurement. For nominal variables, a frequency table and a listing of the mode(s) are sufficient. For ordinal variables, the median can be calculated as a measure of central tendency and the range (and variations of it) as a measure of dispersion. For interval level variables, the arithmetic mean (average) and standard deviation are added to the toolbox and, for ratio level variables, we add the geometric mean and harmonic mean as measures of central tendency and the coefficient of variation as a measure of dispersion.
For interval and ratio level data, further descriptors include the variable’s skewness and kurtosis. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.
Mainly, bar graphs, pie charts, and histograms are used for univariate analysis.
Bivariate Distribution
Bivariate analysis involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.
For two continuous variables, a scatter plot is a common graph. When one variable is categorical and the other continuous, a *box-plot or violin-plot (*also Z-test and t-test) is common and when both are categorical a mosaic plot is common (also chi-square test).
Multivariate Analysis
Multivariate analysis involves observation and analysis of more than one statistical outcome variable at a time. Multi-variate scatter plot, grouped box-plot (or grouped violin-plot), heat-map is used for multivariate analysis.
Function Models
A function can be expressed as an equation, as shown below. In the equation, f represents the function name and x represents the independent variable and y represents the dependent variable.
A linear function has the same average rate of change on every interval. When a linear model is used to describe data, it assumes a constant rate of change.
Exponential functions have variable that appears as the exponent (or power) instead of the base.
The logistic function has the effect of limiting the upper bound, a curve that grows exponentially at first and then slows down and hardly grows at all.
Significance in Data Science
Descriptive Statistics helps you to understand your data and is the initial & very important step of Data Science. This is due to the fact that Data Science is all about making predictions and you can’t predict if you can’t understand the patterns in existing data.
This is the 4th part of the blog post ‘Probability & Statistics for Data Science’, this part covers these topics related to inferential statistics and their significance in data science.
- Inferential Statistics
- Sampling Distributions & Estimation
- Hypothesis Testing (One and Two Group Means)
- Hypothesis Testing (Categorical Data)
- Hypothesis Testing (More Than Two Group Means)
- Quantitative Data (Correlation & Regression)
- Significance in Data Science
Inferential Statistics
Inferential statistics allows you to make inferences about the population from the sample data.
Population & Sample
A sample is a representative subset of a population. Conducting a census on the population is an ideal but impractical approach in most cases. Sampling is much more practical, however, it is prone to sampling error. A sample non-representative of the population is called bias, the method chosen for such sampling is called sampling bias. Convenience bias, judgment bias, size bias, response bias are the main types of sampling bias. The best technique for reducing bias in sampling is randomization. Simple random sampling is the simplest of randomization techniques, cluster sampling & stratified sampling are other systematic sampling techniques.
Sampling Distributions
Sample means to become more and more normally distributed around the true mean (the population parameter) as we increase our sample size. The variability of the sample means decreases as the sample size increases.
Central Limit Theorem
The Central Limit Theorem is used to help us understand the following facts regardless of whether the population distribution is normal or not:
- the mean of the sample means is the same as the population mean.
- the standard deviation of the sample means is always equal to the standard error.
- the distribution of sample means will become increasingly more normal as the sample size increases.
Confidence Intervals
A sample mean can be referred to as a point estimate of a population mean. A confidence interval is always centered around the mean of your sample. To construct the interval, you add a margin of error. The margin of error is found by multiplying the standard error of the mean by the z-score of the percent confidence level:
The confidence level indicates the number of times out of 100 that the mean of the population will be within the given interval of the sample mean.
Hypothesis Testing
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data, and then examining what the data tells us about how to proceed. The hypothesis to be tested is called the null hypothesis and given the symbol Ho. We test the null hypothesis against an alternative hypothesis, which is given the symbol Ha.
When a hypothesis is tested, we must decide on how much of a difference between means is necessary in order to reject the null hypothesis. Statisticians first choose a level of significance or alpha(α) level for their hypothesis test.
Critical values are the values that indicate the edge of the critical region. Critical regions describe the entire area of values that indicate you reject the null hypothesis.
These are the four basic steps we follow for (one & two group means) hypothesis testing:
- State the null and alternative hypotheses.
- Select the appropriate significance level and check the test assumptions.
- Analyze the data and compute the test statistic.
- Interpret the result.
Hypothesis Testing (One and Two Group Means)
Hypothesis Test on One Sample Mean When the Population Parameters are Known
We find the z-statistic of our sample mean in the sampling distribution and determine if that z-score falls within the critical(rejection) region or not. This test is only appropriate when you know the true mean and standard deviation of the population.
Hypothesis Tests When You Don’t Know Your Population Parameters
The Student’s t-distribution is similar to the normal distribution, except it is more spread out and wider in appearance, and has thicker tails. The differences between the t-distribution and the normal distribution are more exaggerated when there are fewer data points and therefore fewer degrees of freedom.
Estimation as a follow-up to a Hypothesis Test
When a hypothesis is rejected, it is often useful to turn to estimation to try to capture the true value of the population mean.
Two-Sample T-Tests
Independent Vs Dependent Samples
When we have independent samples we assume that the scores of one sample do not affect the other.
In two dependent samples of data, each score in one sample is paired with a specific score in the other sample.
Hypothesis Testing (Categorical Data)
Chi-square test is used for categorical data and it can be used to estimate how closely the distribution of a categorical variable matches an expected distribution (the goodness-of-fit test), or to estimate whether two categorical variables are independent of one another (the test of independence).
degree of freedom (d f) = no. of categories(c)−1
degree of freedom (df) = (rows−1)(columns−1)
Hypothesis Testing (More Than Two Group Means)
Analysis of Variance (ANOVA) allows us to test the hypothesis that multiple population means and variances of scores are equal. We can conduct a series of t-tests instead of ANOVA but that would be tedious due to various factors.
We follow a series of steps to perform ANOVA:
- Calculate the total sum of squares (SST )
- Calculate the sum of squares between (SSB)
- Find the sum of squares within groups (SSW ) by subtracting
- Next solve for degrees of freedom for the test
- Using the values, you can now calculate the Mean Squares Between (MSB) and Mean Squares Within (MSW ) using the relationships below
- Finally, calculate the F statistic using the following ratio
- It is easy to fill in the Table from here — and also to see that once the SS and df are filled in, the remaining values in the table for MS and F are simple calculations
- Find F critical
If F-value from the ANOVA test is greater than the F-critical value, so we would reject our Null Hypothesis.
One-Way ANOVA
One-way ANOVA method is the procedure for testing the null hypothesis that the population means and variances of a single independent variable are equal.
Two-Way ANOVA
Two-way ANOVA method is the procedure for testing the null hypothesis that the population means and variances of two independent variables are equal. With this method, we are not only able to study the effect of two independent variables, but also the interaction between these variables.
We can also do two separate one-way ANOVA but two-way ANOVA gives us Efficiency, Control & Interaction.
Quantitative Data (Correlation & Regression)
Correlation
Correlation refers to a mutual relationship or association between quantitative variables. It can help in predicting one quantity from another. It often indicates the presence of a causal relationship. It used as a basic quantity and foundation for many other modeling techniques.
Regression
Regression analysis is a set of statistical processes for estimating the relationships among variables.
Simple Regression
This method uses a single independent variable to predict a dependent variable by fitting the best relationship.
Multiple Regression
This method uses more than one independent variable to predict a dependent variable by fitting the best relationship.
It works best when multicollinearity is absent. It’s a phenomenon in which two or more predictor variables are highly correlated.
Nonlinear Regression
In this method, observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables.
Significance in Data Science
In data science, inferential statistics is used is many ways:
- Making inferences about the population from the sample.
- Concluding whether a sample is significantly different from the population.
- If adding or removing a feature from a model will really help to improve the model.
- If one model is significantly better than the other?
- Hypothesis testing in general.
This is the 5th part of the blog post ‘Probability & Statistics for Data Science’, this part covers these topics related to Bayesian statistics and their significance in data science.
- Frequentist Vs Bayesian Statistics
- Bayesian Inference
- Test for Significance
- Significance in Data Science
Frequentist Vs Bayesian Statistics
Frequentist Statistics test whether an event (hypothesis) occurs or not. It calculates the probability of an event in the long run of the experiment. A very common flaw found in the frequentist approach i.e. dependence of the result of an experiment on the number of times the experiment is repeated.
Frequentist statistics suffered some great flaws in its design and interpretation which posed a serious concern in all real-life problems:
- p-value & Confidence Interval (C.I) depend heavily on the sample size.
- Confidence Intervals (C.I) are not probability distributions
Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data.
Bayesian Inference
To understand Bayesian Inference, you need to understand Conditional Probability & Bayes Theorem, if you want to review these concepts, please refer to my earlier post in this series.
Bayesian inference is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available.
An important part of Bayesian Inference is the establishment of parameters and models. Models are the mathematical formulation of the observed events. Parameters are the factors in the models affecting the observed data. To define our model correctly, we need two mathematical models beforehand. One represents the likelihood function and the other representing the distribution of prior beliefs. The product of these two gives the posterior belief distribution.
Likelihood Function
A likelihood function is a function of the parameters of a statistical model, given specific observed data. Probability describes the plausibility of a random outcome, without reference to any observed data while Likelihood describes the plausibility of a model parameter value, given specific observed data.
Prior & Posterior Belief distribution
Prior Belief distribution is used to represent our strengths on beliefs about the parameters based on the previous experience. Posterior Belief distribution is derived from the multiplication of likelihood function & Prior Belief distribution.
As we collect more data, our posterior belief moves towards prior belief from likelihood:
Test for Significance
Bayes factor
Bayes factor is the equivalent of p-value in the Bayesian framework. The null hypothesis in Bayesian framework assumes ∞ probability distribution only at a particular value of a parameter (say θ=0.5) and a zero probability else where. The alternative hypothesis is that all values of θ are possible, hence a flat curve representing the distribution.
Using Bayes Factor instead of p-values is more beneficial in many cases since they are independent of intentions and sample size.
High Density Interval (HDI)
High Density Interval (HDI) or Credibility Interval is equivalent to Confidence Interval (CI) in Bayesian framework. HDI is formed from the posterior distribution after observing the new data.
Using High Density Interval (HDI) instead of Confidence Interval (CI) is more beneficial since they are independent of intentions and sample size.
Moreover, there is a nice article published on AnalyticsVidhya on this which elaborate on these concepts with examples:
Significance in Data Science
Bayesian statistics encompasses a specific class of models that could be used for Data Science. Typically, one draws on Bayesian models for one or more of a variety of reasons, such as:
- having relatively few data points
- having strong prior intuitions
- having high levels of uncertainty
This is the 6th & last post of blog post series ‘Probability & Statistics for Data Science’, this post covers these topics related to Statistical Learning and their significance in data science.
- Introduction
- Prediction & Inference
- Parametric & Non-parametric methods
- Prediction Accuracy and Model Interpretability
- Bias-Variance Trade-Off
Introduction
Statistical learning is a framework for understanding data based on statistics, which can be classified as supervised or unsupervised. Supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs, while in unsupervised statistical learning, there are inputs but no supervising output; but we can learn relationships and structure from such data.
One of the simple ways to understand statistical learning is to determine the association between predictors (independent variables, features) & response (dependent variable) and developing an accurate model that can predict response variable (Y) on basis of predictor variables (X).
Y = f(X) + ɛ where X = (X1,X2, . . .,Xp), f is an unknown function & ɛ is random error (reducible & irreducible).
Prediction & Inference
In situations where a set of inputs X are readily available, but the output Y is not known, we often treat f as black box (not concerned with the exact form of f), as long as it yields accurate predictions for Y . This is prediction.
There are situations where we are interested in understanding the way that Y is affected as X change. In this situation we wish to estimate f, but our goal is not necessarily to make predictions for Y . Here we are more interested in understanding relationship between X and Y. Now f cannot be treated as a black box, because we need to know its exact form. This is inference.
In real life, will see a number of problems that fall into the prediction setting, the inference setting, or a combination of the two.
Parametric & Non-parametric methods
When we make an assumption about the functional form of f and try to estimate f by estimating the set of parameters, these methods are called parametric methods.
f(X) = β0 + β1X1 + β2X2 + . . . + βpXp
Non-parametric methods do not make explicit assumptions about the form of f, instead, they seek an estimate of f that gets as close to the data points as possible.
Prediction Accuracy and Model Interpretability
Of the many methods that we use for statistical learning, some are less flexible, or more restrictive. When inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. When we are only interested in prediction, we use flexible models available.
Assessing Model Accuracy
There is no free lunch in statistics, which means no one method dominates all others over all possible data sets. In the regression setting, the most commonly-used measure is the mean squared error (MSE). In the classification setting, the most commonly-used measure is the confusion matrix. The fundamental property of statistical learning is that, as model flexibility increases, training error will decrease, but the test error may not.
Bias & Variance
Bias are the simplifying assumptions made by a model to make the target function easier to learn. Parametric models have a high bias making them fast to learn and easier to understand but generally less flexible. Decision Trees, k-Nearest Neighbors, and Support Vector Machines are low-bias machine learning algorithms. Linear Regression, Linear Discriminant Analysis, and Logistic Regression are high-bias machine learning algorithms.
Variance is the amount that the estimate of the target function will change if different training data was used. Non-parametric models that have a lot of flexibility have a high variance. Linear Regression, Linear Discriminant Analysis, and Logistic Regression are low-variance machine learning algorithms. Decision Trees, k-Nearest Neighbors, and Support Vector Machines are high-variance machine learning algorithms.
Bias-Variance Trade-Off
The relationship between bias and variance in statistical learning is such that:
- Increasing bias will decrease variance.
- Increasing variance will decrease bias.
There is a trade-off at play between these two concerns and the models we choose and the way we choose to configure them are finding different balances in this trade-off for our problem.
In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method. The bias-variance trade-off, and the resulting U-shape in the test error, can make this a difficult task.
References
Ankit Rathi is a Principal Data Scientist, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.