Probability Theory
Probability Space
Formally, a probability space is defined by the triple $(\Omega, F, P)$, where
- is the space of possible outcomes (or outcome space)
- (the power set of ) is the space of (measurable) events (or event space)
- is the probability measure (or probability distribution) that maps an event to a real value between and (think of as a function)
Given the outcome space , there is some restrictions as to what subset of can be considered an event space :
- The trivial event and the empty event is in
- The event space is closed under (countable) union, i.e., if , then
- The event space is closed under complement, i.e., if , then
Given an event space , the probability measure must satisfy certain axioms.
- (non-negativity) For all
- (trivial event)
- (additivity) For all and
Random Variables
The most important fact about random variables is that they are not variables. They are actually functions that map outcomes (in the outcome space) to real values.
In a sense, random variables allow us to abstract away from the formal notion of event space, as we can define random variables that capture the appropriate events.
Consider the event space of odd or even in dice throw. We could have defined a random variable that takes on value if outcome is odd and otherwise. These type of binary random variables are very common in practice, and are known as indicator variables, taking its name from its use to indicate whether a certain event has happened.
So why did we introduce event space? That is because when one studies probability theory (more rigorously) using measure theory, the distinction between outcome space and event space will be very important. In any case, it is good to keep in mind that event space is not always simply the power set of the outcome space.
We will talk mostly about probability with respect to random variables. Random variables allow us to provide a more uniform treatment of probability theory. For notations, the probability of a random variable taking on the value of will be denoted by either
We will also denote the range of a random variable by .
Distributions, Joint Distributions, and Marginal Distributions
The distribution of a variable is formally refers to the probability of a random variable taking on certain values. For notation, we will use to denote the distribution of the random variable .
We speak about the distribution of more than one variables at a time. We call these distributions joint distributions, as the probability is determined jointly by all the variables involved. We will denote the probability of taking value and taking value by either the long hand of , or the short hand of . We refer to their joint distribution by .
Given a joint distribution, say over random variables and , we can talk about the marginal distribution of or that of . The marginal distribution refers to the probability distribution of a random variable on its own. To find out the marginal distribution of a random variable, we sum out all the other random variables from the distribution.
Conditional Distributions
Conditional distributions are one of the key tools in probability theory for reasoning about uncertainty. They specify the distribution of a random variable when the value of another random variable is known (or more generally, when some event is known to be true).
Formally, conditional probability of given is defined as
Note that this is not defined when the probability of is .
The idea of conditional probability extends naturally to the case when the distribution of a random variable is conditioned on several variables, namely
As for notations, we write to denote the distribution of random variable when . We may also write to denote a set of distributions of , one for each of the different values that can take.
Independence
In probability theory, independence means that the distribution of a random variable does not change on learning the value of another random variable.
Mathematically, a random variable is independent of when
It is easy to verify that if is independent of , then is also independent of . As a notation, we write if and are independent.
An equivalent mathematical statement about the independence of random variables and is
Sometimes we also talk about conditional independence, meaning that if we know the value of a random variable (or more generally, a set of random variables), then some other random variables will be independent of each other. Formally, we say and are conditionally independent given if
or, equivalently,
Chain Rule and Bayes Rule
The Chain Rule is often used to evaluate the joint probability of some random variables, and is especially useful when there are (conditional) independence across variables.
The Bayes Rule allows us to compute the conditional probability from , in a sense inverting the conditions.
Extending the Bayes Rule to the case of multiple random variables as following
Using definition of marginal distribution and chain rule, we can get Law Of Total Probability
Probability Distribution
In a broad sense, there are two classes of distribution that require seemingly different treatments (these can be unified using measure theory). Namely, discrete distributions and continuous distributions.
Discrete Distribution: Probability Mass Function
By a discrete distribution, we mean that the random variable of the underlying distribution can take on only finitely many different values (or that the outcome space is finite).
To define a discrete distribution, we can simply enumerate the probability of the random variable taking on each of the possible values. This enumeration is known as the probability mass function, as it divides up a unit mass (the total probability) and places them on the different values a random variable can take. This can be extended analogously to joint distributions and conditional distributions.
Continuous Distribution: Probability Density Function
By a continuous distribution, we mean that the random variable of the underlying distribution can take on infinitely many different values (or that the outcome space is infinite).
To define a continuous distribution, we will make use of probability density function (PDF). A probability density function, , is a non-negative, integrable function such that
The probability of a random variable distributed according to a PDF is computed as follows
Note that this, in particular, implies that the probability of a continuously distributed random variable taking on any given single value is zero.
To extend the definition of continuous distribution to joint distribution, the probability density function is extended to take multiple arguments, namely,
To extend the definition of conditional distribution to continuous random variables, we ran into the problem that the probability of a continuous random variable taking on a single value is , so bayes rule is not well defined, since the denominator equals . To define the conditional distribution of a continuous variable, let be the joint distribution of and . We can show that the PDF, , underlying the distribution is given by
Sometimes we will also speak about cumulative distribution function. It is a function that gives the probability of a random variable being smaller than some value. A cumulative distribution function is related to the underlying probability density function as follows:
and hence (in the sense of indefinite integral).
Expectations and Variance
Expectations
One of the most common operations we perform on a random variable is to compute its expectation, also known as its mean or expected value. The expectation of a random variable, denoted by , is given by
When working with indicator variables, a useful identify is the following:
When working with the sums of random variables, one of the most important rule is the linearity of expectations. Let be (possibly dependent) random variables.
The linearity of expectations is very powerful because there are no restrictions on whether the random variables are independent or not.
When we work on products of random variables, however, there is very little we can say in general. However, when the random variables are independent. Let and be independent random variables
Variance
The variance of a distribution is a measure of the spread of a distribution. It is defined as follows:
The variance of a random variable is often denoted by $\sigma^2$. The reason that this is squared is because we often want to find out $\sigma$, known as the standard deviation. The variance and the standard deviation is related by $\sigma = \sqrt{Var(X)}$.
To find out the variance of a random variable , it’s often easier to compute the following instead
Note that unlike expectation, variance is not a linear function of a random variable . In fact, we can verify that the variance of is
If random variables and are independent, then
Sometimes we also talk about the covariance of two random variables. This is a measure of how closely related two random variables are. Its definition is as follows.
Important Probability Distributions
Uniform distribution
The uniform distribution has all intervals of the same length on the distribution are equally probable. The corresponding probability density function would be
Bernoulli
The Bernoulli distribution is one of the most basic distribution. A random variable distributed according to the Bernoulli distribution can take on two possible values, and . It can be specified by a single parameter , and by convention we take to be . It is often used to indicate whether a experiment is successful or not.
Sometimes it is useful to write the probability distribution of a Bernoulli random variable as follows
Binomial
The binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of independent experiments. In general, the random variable follows the binomial distribution with parameters and , we write . The probability of getting exactly successes in experiments is given by the probability mass function:
Poisson
The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event.
The mean value of a Poisson random variable is , and its variance is also $\lambda$.
Gaussian
The Gaussian distribution, also known as the normal distribution, is one of the most versatile distributions in probability theory, and appears in a wide variety of contexts. It can be used to approximate the binomial distribution when the number of experiments is large, or the Poisson distribution when the average arrival rate is high. It is also related to the Law of Large Numbers. For many problems, we will also often assume that when noise in the system is Gaussian distributed.
The Gaussian distribution is determined by two parameters: the mean and the variance . The probability density function is given by
We will sometimes work with multi-variate Gaussian distributions. A -dimensional multi-variate Gaussian distribution is parametrized by , where is now a vector of means in , and is the covariance matrix in , in other words, and . The probability density function is now defined over vectors of input, given by
When the covariances are zero, the determinant will simply be the product of the variances, and the inverse can be found by taking the inverse of the diagonal entries of .