Mathematics for Data Science

Ankit Rathi
11 min readJun 19, 2021

This is the first and introductory part of the blog post ‘Mathematics for Data Science’, this part covers the context, table of content of the upcoming parts of this post topic-wise.

  1. Context & Introduction
  2. Linear Algebra for Data Science
  3. Multivariate Calculus for Data Science
  4. Probability & Statistics for Data Science

For Data Science & Machine Learning beginners, it is essential to develop a mathematical understanding of underlying concepts. Data Science is simply an evolved version of statistics and mathematics (combined with programming and business logic). Many data scientists struggle to explain the intrinsic details of predictive models. More than just deriving accuracy, understanding & interpreting every metric, the calculation behind that accuracy is important.

Even for a lot of higher-level courses/books on Machine Learning/Data Science, you need to brush up on the basics in mathematics, but if you refer to these topics in textbooks, you will find the Data Science context missing. This blog post targets to bridge that gap, it will cover all the underlying mathematics, to build an intuitive understanding, and you will be able to relate it to Machine Learning and Data Science.

Please note that this blog post covers mathematical concepts used in data science intuitively, and doesn’t cover the details or implementation exhaustively.

This is the 2nd part of the blog post ‘Mathematics for Data Science’, this part covers these topics related to Linear Algebra.

  • What is Linear Algebra?
  • Why Linear Algebra is important in Data Science?
  • How Linear Algebra is applied in Data Science?

What is Linear Algebra?

Linear algebra is the branch of mathematics concerning linear equations, linear functions, and their representations through matrices and vector spaces. ~Wikipedia

The word algebra comes from the Arabi word “al-jabr” which means “the reunion of broken parts”. This is a collection of methods deriving unknowns from knowns in mathematics. Linear Algebra is the branch that deals with linear equations and linear functions which are represented through matrices and vectors. In simpler words, it helps us understand geometric terms such as planes, in higher dimensions, and perform mathematical operations on them. By definition, algebra deals primarily with scalars (one-dimensional entities), but Linear Algebra has vectors and matrices (entities that possess two or more dimensional components) to deal with linear equations and functions.

Why Linear Algebra is significant in Data Science?

Linear Algebra is central to almost all areas of mathematics like geometry and functional analysis. Its concepts are a crucial prerequisite for understanding the theory behind Data Science. You don’t need to understand Linear Algebra before getting started in Data Science, but at some point, you may want to gain a better understanding of how the different algorithms really work under the hood. So if you really want to be a professional in this field, you will have to master the parts of Linear Algebra that are important for Data Science.

How Linear Algebra is used in Data Science?

Scalars, Vectors, Matrices and Tensors

  • A scalar is a single number
  • A vector is an array of numbers.
  • A matrix is a 2-D array
  • A tensor is a n-dimensional array with n>2

With transposition, we can convert a row vector to a column vector and vice versa.

Multiplying Matrices and Vectors

The dot product of matrices & vectors is used in every equation explaining data science algorithms. Matrices multiplication is distributive, associative, NOT commutative, while vector multiplication is commutative.

Identity and Inverse Matrices

The identity matrix In is a special matrix of shape (n×n) that is filled with 0 except the diagonal that is filled with 1. An inverse matrix is that results in the identity matrix when it is multiplied by its original form.

Linear Dependence and Span

We cover here how to represent systems of equations graphically, how to interpret the number of solutions of a system, what is linear combination, dependence and span.

Norms

The norm is what is generally used to evaluate the error of a model. For instance, it is used to calculate the error between the output of a neural network and what is expected (the actual label or value).

Special Kinds of Matrices and Vectors

This section covers the different interesting types of matrices with specific properties i.e. Diagonal, Symmetric & Orthogonal matrices.

Eigendecomposition

The eigendecomposition is one form of matrix decomposition. Decomposing a matrix means that we want to find a product of matrices that is equal to the initial matrix. In the case of the eigendecomposition, we decompose the initial matrix into the product of its eigenvectors and eigenvalues.

The eigendecomposition of the matrix corresponding to a quadratic equation can be used to find the minimum and maximum of that function.

Singular Value Decomposition (SVD)

The way to go to decompose other types of matrices that can’t be decomposed with eigendecomposition is to use SVD. With the SVD, we decompose a matrix in three other matrices. We can see these new matrices as sub-transformations of the space. Instead of doing the transformation in one movement, we decompose it into three movements.

The Moore Penrose Pseudoinverse

The inverse is used to solve a system of equations but not all matrices have an inverse. In some cases, a system of equations has no solution, and thus the inverse doesn’t exist. However, it can be useful to find a value that is almost a solution (in terms of minimizing the error). We can find the best-fit line of a set of data points with the pseudoinverse.

The Trace Operator

The trace is the sum of all values in the diagonal of a square matrix. It can be used to specify the Frobenius norm of a matrix, which is needed for the Principal Component Analysis (PCA)

The Determinant

The determinant of a matrix AA is a number corresponding to the multiplicative change you get when you transform your space with this matrix. A negative determinant means that there is a change in orientation (and not just a rescaling and/or a rotation).

Principal Component Analysis

When the data set is high-dimensional, it would be nice to have a way to reduce these dimensions while keeping all the information present in the data set. The aim of principal components analysis (PCA) is generally to reduce the number of dimensions of a data set where dimensions are not completely decorrelated.

This is the 3rd part of the blog post ‘Mathematics for Data Science’, this part covers these topics related to Multivariate Calculus.

  • What is Multivariate Calculus?
  • Why Multivariate Calculus is important in Data Science?
  • How Multivariate Calculus is applied in Data Science?

What is Multivariate Calculus?

Multivariate Calculus (also known as multivariable calculus) is the extension of calculus in one variable to calculus with functions of several variables: the differentiation and integration of functions involving multiple variables, rather than just one. ~Wikipedia

Calculus is a set of tools for analyzing the relationship between functions and their inputs. In Multivariate Calculus we can take a function with multiple inputs and determine the influence of each of them separately.

Why Multivariate Calculus is important in Data Science?

In data science, we try to find the inputs which enable a function to best match the data. The slope or descent describes the rate of change off the output with respect to an input. Determining the influence of each input on the output is also one of the critical tasks. All this requires a solid understanding of Multivariate Calculus.

How Multivariate Calculus is applied in Data Science?

First, let's cover the core concepts of Calculus:

Functions

An equation will be a function if, for any x in the domain of the equation, the equation will yield exactly one value of y when we evaluate the equation at a specific x.

y = f(x)

Derivatives

The derivative of f(x) with respect to x is the function f′(x) and is defined as:

Product Rule

If the two functions f(x) and g(x) are differentiable (i.e. the derivative exist) then the product is differentiable and:

Chain Rule

Suppose that we have two functions f(x) and g(x) and they are both differentiable.

  • If we define F(x)=(f∘g)(x) then the derivative of F(x) is:
  • If we have y=f(u) and u=g(x) then the derivative of y is:

Integrals

If F(x) is any anti-derivative of f(x) then the most general anti-derivative of f(x) is called an indefinite integral and denoted:

c is any constant

Given a function f(x) that is continuous on the interval [a,b] we divide the interval into n subintervals of equal width (Δx) and from each interval choose a point, x∗i. Then the definite integral of f(x) from a to b is:

Partial Derivatives

A partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary).

Now, let's look at the core concepts of Multivariate Calculus & how those relate to Data Science:

Gradient, Jacobian, and Hessian

The Gradient

Gradients show changes along with a particular variable or set of variables as a function “move” in space, and they are very easy to compute within optimization frameworks. The algorithms find minima/maxima of these functions and base optimization on these estimates.

The Jacobian

The Jacobian of a set of functions is a matrix of partial derivatives of the functions. If you have just one function instead of a set of functions, the Jacobian is the gradient of the function.

The Hessian

Hessian is, in some crude sense, the rate of change of the function. In regards to optimization, Hessian being positive definite, negative definite, or indefinite tells you about whether the optima in question is a maxima, minima, or a saddle point.

Multivariate Chain Rule

Suppose that z=f(x,y), where x and y themselves depend on one or more variables. Multivariable Chain Rules allow us to differentiate z with respect to any of the variables involved:

Approximate Functions

A function approximation problem asks us to select a function among a well-defined class that closely matches (“approximates”) a target function.

Statistical and connectionist approaches to machine learning are related to function approximation methods in mathematics.

Common techniques include the Taylor series and the Fourier series approximations. Recall that given enough terms, a Taylor series can approximate any function to a certain level of precision about a given point, while a Fourier series can approximate any periodic function.

Power Series

A power series is any series that can be written in the form where a and cn are numbers. The cn’s are often called the coefficients of the series. The first thing to notice about a power series is that it is a function of x.

In data science, Power Series can be used to give you some indication of the size of the error, that results from using these approximations.

Linearization

Linearisation is finding the linear approximation to a function at a given point. The linear approximation of a function is the first-order Taylor expansion around the point of interest.

The above equation is called ‘linearization of f at a’.

Multivariate Taylor

A Multivariate Taylor series is an idea used in data science and other kinds of higher-level mathematics. It is a series that is used to create an estimate (guess) of what a function looks like with multiple inputs.

Multivariate Taylor is the Power Series to its more general multivariate form.

Where J and H, are the Jacobian and Hessian of F.

This is the 3rd and last part of the blog post ‘Mathematics for Data Science’, Probability and Statistics is such a vast subject, it requires a separate blog post to discuss all relevant topics.

These are the topics covered in the attached post:

  • Probability
  • Descriptive Statistics
  • Inferential Statistics
  • Bayesian Statistics
  • Statistical Learning

References & Resources

Khan Academy Calculus series

Khan Academy Linear Algebra series

3Blue1Brown Calculus series

3Blue1Brown Linear Algebra series

Ankit Rathi is a Principal Data Scientist, published author & well-known speaker. His interest lies primarily in building end-to-end AI applications/products following best practices of Data Engineering and Architecture.

--

--