Probability Review

Sample Spaces & Events

VIDEO: Watch this video for a quick tutorial of the concepts below.

Throughout, let \(A, B\) be events in sample space \(S\), i.e. \(A \subseteq S, B \subseteq S\). The following are some “special” events:

  • \(\emptyset\) = the empty / null set containing no outcomes

  • \(A^c\) = “\(A\) complement” = all outcomes in \(S\) that are not in \(A\)

  • \(A \cap B\) = “intersection of \(A\) and \(B\)” = all outcomes in both \(A\) and \(B\)
    \(A\) and \(B\) are mutually exclusive or disjoint if \(A \cap B = \emptyset\)

  • \(A \cup B\) = “union of \(A\) and \(B\)” = all outcomes in \(A\) or \(B\) or both



Probability, Conditional Probability & Independence

VIDEO: Watch this video for a quick tutorial of the probability concepts below.

Recall the axioms & rules of probability.


Probability

Let \(A, B\) be events in sample space \(S\). A probability function \(P\) must satisfy the following axioms:

  1. \(P(\emptyset) = 0\)

  2. \(P(S) = 1\)

  3. If \(A \cap B = \emptyset\), then \(P(A \cup B) = P(A) + P(B)\).

The following properties follow from these axioms:

  1. \(P(A^c) = 1 - P(A)\)

  2. If \(A \subseteq B\), then \(P(A) \le P(B)\)

  3. \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)


Sometimes we know that a certain event, say \(B\), has occurred. This might or might not change the plausibility of another event \(A\). The concepts of conditional probability and independence are summarized here.


Conditional Probability

The conditional probability of \(A\) given \(B\), \(P(A | B)\), is the probability of observing \(A\) given the knowledge that we’ve observed \(B\). In calculating \(P(A | B)\), we calculate the likelihood of \(A\) relative to the shrunken sample space of \(B\):

\[P(A|B) = \frac{P(A\cap B)}{P(B)}\]

This concept is captured in the following image.


Independence

Events \(A\) and \(B\) are independent if the occurrence of \(B\) doesn’t change our information about \(A\), i.e. \[P(A|B) = P(A)\] Equivalently, \(A\) and \(B\) are independent if \[P(A \cap B) = P(A)P(B)\]

In general, let \(A_1, A_2, ..., A_k\) be \(k\) pairwise independent events. That is, all pairs \(A_i\) and \(A_j\) for \(i \ne j\) and \(i,j \in \{1,...,k\}\) are independent. Then \[P\left( \cap_{i=1}^k A_i \right) := P(A_1 \cap A_2 \cap \cdots A_k) = P(A_1)P(A_2)\cdots P(A_k) := \prod_{i=1}^k P(A_i) \; .\]




Random Variables

Let \(X, Y\) be random variables (RVs), i.e. numerical functions of a sample space. There are 2 types of RVs:

  • discrete
    Example: \(X\) = number of heads in 10 coin flips. Then \(X \in \{1,2,...,10\}\).
  • continuous
    Example: \(X\) = speed of a car. Then \(X \in [0,\infty)\).



Marginal, Joint, & Conditional pdfs

The marginal model of \(X\), i.e. the behavior of \(X\) ignoring all other variables, is captured by its probability density function (pdf) \[f(x)\] where \(f(x)\) describes the possible values of \(X\) and their relative likelihoods.

Properties of a marginal pdf

\(X\) discrete:

  • \(f(x) = P(\{X=x\})\)
  • \(0 \le f(x) \le 1\)
  • \(\sum_{\text{all } x} f(x) = 1\)

\(X\) continuous:

  • \(f(x) \ge 0\) (not a probability)
  • \(\int_{-\infty}^\infty f(x) dx = 1\)


For example, consider a discrete marginal pdf (left) and continuous marginal pdf (right):



The simultaneous behavior of \(X\) and \(Y\) is described by their joint pdf \[f(x,y)\] where \(f(x,y)\) satisfies the following properties.


Properties of a joint pdf

\(X, Y\) discrete:

  • \(f(x,y) = P(\{X=x\} \cap \{Y=y\}) = P(X=x, Y=y) \;\; \text{(shorthand)}\)
  • \(0 \le f(x,y) \le 1\)
  • \(\sum_{\text{all } x} \sum_{\text{all } y} f(x,y) = 1\)

\(X, Y\) continuous:

  • \(f(x,y) \ge 0\) (not a probability)
  • \(\int_{-\infty}^\infty \int_{-\infty}^\infty f(x,y) dx dy = 1\)


For example, consider a discrete joint pdf (left) and continuous joint pdf (right):


Of course, we can use information about the joint behavior of \(X\) and \(Y\) to learn about their marginal behaviors.


Connections between the joint and marginal pdfs

\(X, Y\) discrete: \[\begin{split} f(x) & = \sum_{all \; y} f(x,y) \\ f(y) & = \sum_{all \; x} f(x,y) \\ \end{split}\]

\(X, Y\) continuous: \[\begin{split} f(x) & = \int_{-\infty}^\infty f(x,y) dy \\ f(y) & = \int_{-\infty}^\infty f(x,y) dx \\ \end{split}\]



In some cases, we know the value of \(Y\) but not the value of \(X\). We can utilize this information if we know the conditional behavior of \(X\) when \(Y=y\): \[f(x|y)\]


Properties of a conditional pdf

\(X,Y\) discrete:

  • \(f(x|y) = P(\{X=x\} | \{Y=y\})\)
  • \(0 \le f(x|y) \le 1\)
  • \(\sum_{\text{all } x} f(x|y) = 1\)

\(X,Y\) continuous:

  • \(f(x|y) \ge 0\) (not a probability)
  • \(\int_{-\infty}^\infty f(x|y) dx = 1\)



Recall that we can use the marginal probability (\(P(B)\)) and joint probability (\(P(A \cap B)\)) to calculate the conditional probability (\(P(A|B)\)): \[P(A|B) = \frac{P(A \cap B)}{P(B)}\] Similarly, we can calculate the conditional pdf from the marginal and joint pdfs:


Bayes’ Rule & Law of Total Probability extended to pdfs

Bayes’ Rule can be extended to RVs:

\[f(x|y) = \frac{f(x,y)}{f(y)} = \frac{f(y|x)f(x)}{f(y)}\]

where, by the Law of Total probability

\[f(y) = \begin{cases} \sum_{\text{all } x} f(y|x)f(x) & X \text{ discrete} \\ \int f(y|x)f(x)dx & X \text{ continuous} \\ \end{cases}\]


Further, recall that for events \(A\) and \(B\), we say that \(A\) and \(B\) are independent if the knowledge of \(B\) doesn’t change the plausibility of \(A\): \[A,B \text{ independent } \Rightarrow \begin{split} P(A|B) & = P(A) \\ P(A \cap B) & = P(A)P(B) \end{split}\] We can similarly define the independence of 2 random variables.


Independent RVs

RVs \(X\) and \(Y\) are independent if and only if \[\begin{split} f(x|y) & = f(x) \\ f(x,y) & = f(x)f(y) \\ \end{split}\]



Properties of RVs

VIDEO: Watch this video for a quick tutorial of expected value & variance.

Important features of a RV \(X\) include its expected value, mode, and variability.



Expected Value / Mean

The expected value of \(X\), \(E(X)\), measures the typical or mean value of \(X\). It is calculated by a weighted average of all possible values of \(X\): \[\begin{align} X \text{ discrete: } && E(X) & = \sum_{\text{all } x} x \; f(x) \\ X \text{ continuous: } && E(X) & = \int_{\text{all } x} x \; f(x) dx \\\end{align}\]

Note: The units of \(E(X)\) are the same as the units of \(X\).



Mode

The mode of \(X\), \(Mode(X)\), is the value at which the pdf is maximized, ie. the most plausible value of \(X\). \[\begin{align} Mode(X) & = \text{argmax}_{x} \; f(x) \\ \end{align}\]

Note: The units of \(Mode(X)\) are the same as the units of \(X\).



Variance & Standard Deviation

The variance of \(X\), \(Var(X)\), measures the degree of the variability in \(X\). Specifically, it’s the weighted average of squared deviations from the mean, \((X - E(X))^2\): \[\begin{align} X \text{ discrete: } && Var(X) & = \sum_{\text{all } x} (x-E(X))^2 \; f(x) \\ X \text{ continuous: } && Var(X) & = \int_{\text{all } x} (x-E(X))^2 \; f(x) dx \\\end{align}\]

Note: If \(X\) has units \(u\), \(Var(X)\) has units \(u^2\). Thus, to work on the raw units scale, we typically work with the standard deviation of \(X\), \(SD(X)\), instead of \(Var(X)\): \[SD(X) = \sqrt{Var(X)}\]



Integration Technique

VIDEO: Watch this video for a quick tutorial of the “integration technique” - a handy tool for calculating expected values & variance (among other things).



Named Models

23.5.1 Binomial / Bernoulli

VIDEO: Watch this video for a quick tutorial of the Binomial model.

Let discrete RV \(X\) be the number of successes in \(n\) trials where:

  • the trials are independent;
  • each trial has an equal probability \(p\) of success

Then \(X\) is Binomial with parameters \(n\) and \(p\):
\[X \sim Bin(n,p)\] with pdf \[f(x) = P(X=x) = \left(\begin{array}{c} n \\ x \end{array} \right) p^x (1-p)^{n-x} \;\; \text{ for } x \in \{0,1,2,...,n\}\]

Properties: \[\begin{split}E(X) & = np \\ Var(X) & = np(1-p) \\ \end{split}\]

NOTE:
In the special case in which \(n=1\), ie. we only observe 1 trial, then \[X \sim Bern(p)\] That is, the \(Bern(p)\) and \(Bin(1,p)\) models are equivalent.

Consider some plots of Binomial RVs for different parameters \(n\) and \(p\):

In R: Suppose \(X \sim Bin(n,p)\)

# Calculate f(x)
dbinom(x, size = n, prob = p)

# Calculate P(X <= x)    
pbinom(x, size = n, prob = p)

# Plot f(x) for x from 0 to n
library(ggplot2)
x <- c(0:n)
binom_data <- data.frame(x = x, fx = dbinom(x, size = n, prob = p))
ggplot(binom_data, aes(x = x, y = fx)) + 
    geom_point()




23.5.2 Discrete Uniform

Suppose discrete RV \(X\) is equally likely to be any value in the discrete set \(S = \{s_1,s_2,...,s_n\}\). Then \(X\) has a discrete uniform model on \(S\) with pdf

\[f(x) = P(X=x) = \frac{1}{n} \;\; \text{ for } x \in \{s_1,s_2,...,s_n\}\]

23.5.3 Geometric

Let discrete RV \(X\) be the number of trials until the 1st success where:

  • the trials are independent;
  • each trial has an equal probability \(p\) of success

Then \(X\) is Geometric with parameter \(p \in (0,1]\):
\[X \sim Geo(p)\] with pdf \[f(x) = P(X=x) = (1-p)^{x-1}p \;\; \text{ for } x \in \{1,2,...\}\]

Properties: \[\begin{split} E(X) & = \frac{1}{p} \\ \text{Mode}(X) & = 1 \\ Var(X) & = \frac{1-p}{p^2} \\ \end{split}\]

Notes:

  • \(f(x) = (\text{probability of $x-1$ failures}) * (\text{probability of 1 success})\)
  • There are other parameterizations of the geometric model. They are used in different ways but produce the same “answer.”

Consider some plots of Geometric RVs for different parameters \(p\):

In R: Suppose \(X \sim Geo(p)\)

# Calculate f(x)
dgeom(x - 1, prob = p)

# Calculate P(X <= x)    
pgeom(x - 1, prob = p)

# Plot f(x) for x from 1 to m (you pick m)
library(ggplot2)
x <- c(1:m)
geo_data <- data.frame(x = x, fx = dgeom(x-1, prob = p))
ggplot(geo_data, aes(x = x, y = fx)) + 
    geom_point()

23.5.4 Poisson

VIDEO: Watch this video for a quick tutorial of the Poisson model.

Let discrete RV \(X \in \{0,1,2,...\}\) be the number of events in a given time period. The outcome of \(X\) depends upon parameter \(\lambda>0\), the rate at which the events occur (ie. the average number of events per time interval). In this setting, we can often model \(X\) by a Poisson model: \[X \sim Pois(\lambda)\] with pdf \[f(x) = P(X=x) = \frac{\lambda^x e^{-\lambda}}{x!} \;\; \text{ for } x \in \{0,1,2,...\}\]

Properties: \[\begin{split}E(X) & = \lambda \\ Var(X) & = \lambda \\ \end{split}\] Consider some plots of Poisson RVs for different parameters \(\lambda\):

In R: Suppose \(X \sim Pois(a)\)

# Calculate f(x)
dpois(x, lambda = a)

# Calculate P(X <= x)    
ppois(x, lambda = a)

# Plot f(x) for x from 0 to m (you choose m)
library(ggplot2)
x <- c(0:m)
pois_data <- data.frame(x = x, fx=dpois(x, lambda = a))
ggplot(pois_data, aes(x = x, y = fx)) + 
    geom_point()

23.5.5 Continuous Uniform

If continuous RV \(X\) is uniformly distributed across the interval \([a,b]\), then

\[X \sim Unif(a,b)\]

with pdf

\[f(x) = \frac{1}{b-a} \;\; \text{ for } x \in [a,b]\]

Properties: \[\begin{split}E(X) & = \frac{a+b}{2} \\ Var(X) & = \frac{(b-a)^2}{12} \\ \end{split}\]

Consider the Uniform model for different parameters \(a\) and \(b\):

In R: Suppose \(X \sim Unif(a,b)\)

# Calculate f(x)
dunif(x, min = a, max = b)

# Calculate P(X <= x)    
punif(x, min = a, max = b)

# Plot f(x) for x from a-1 to b+1 (you pick the range)
ggplot(NULL, aes(x = c(a-1, b+1))) + 
    stat_function(fun = dunif, args = list(min = a, max = b))




23.5.6 Beta

VIDEO: Watch this video for a quick tutorial of the Beta model.

The Beta model can be used to model a continuous RV \(X\) that’s restricted to values in the interval [0,1]. It’s defined by two shape parameters, \(\alpha\) and \(\beta\):

\[X \sim Beta(\alpha,\beta)\]

with pdf

\[f(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1} \;\; \text{ for } x \in [0,1]\]

NOTE: For positive integers \(n\), the gamma function simplifies to \(\Gamma(n) = (n-1)!\). Otherwise, \(\Gamma(z) = \int_0^\infty x^{z-1}e^{-x}dx\).

Properties: \[\begin{split} E(X) & = \frac{\alpha}{\alpha + \beta} \\ Mode(X) & = \frac{\alpha-1}{\alpha+\beta-2} \\ Var(X) & = \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha+\beta+1)}\\ \end{split}\]

Consider the Beta model for different parameters \(\alpha\) and \(\beta\):

In R: Suppose \(X \sim Beta(a,b)\)

# Calculate f(x)
dbeta(x, shape1 = a, shape2 = b)

# Calculate P(X <= x)    
pbeta(x, shape1 = a, shape2 = b)

# Plot f(x) for x from 0 to m (you pick the range)
ggplot(NULL, aes(x = c(0, m))) + 
    stat_function(fun = dbeta, args = list(shape1 = a, shape2 = b))



23.5.7 Exponential

Let \(X\) be the waiting time until a given event occurs. For example, \(X\) might be the time until a lightbulb burns out or the time until a bus arrives. The outcome of \(X\) depends upon parameter \(\lambda>0\), the rate at which events occur (ie. the number of events per time period). In this scenario, \(X\) can often be modeled by an Exponential model: \[X \sim \text{Exp}(\lambda)\]

with pdf \[f(x) = \lambda e^{-\lambda x} \;\; \text{ for } x \in [0,\infty)\]

Properties: \[\begin{split} E(X) & = \frac{1}{\lambda} \\ Var(X) & = \frac{1}{\lambda^2} \\ \end{split}\]

Consider the Exponential model for different parameters \(\lambda\):

In R: Suppose \(X \sim Exp(a)\)

# Calculate f(x)
dexp(x, rate = a)

# Calculate P(X <= x)    
pexp(x, rate = a)

# Plot of f(x) for x from 0 to m (you pick the range)
ggplot(NULL, aes(x = c(0, m))) + 
    stat_function(fun = dexp, args = list(rate = a))



23.5.8 Gamma

VIDEO: Watch this video for a quick tutorial of the Gamma model.

The Gamma model is often appropriate for modeling positive RVs \(X\) (\(X>0\)) with right skew. For example, let \(X\) be the waiting time until a given event occurs \(s\) times. The outcome of \(X\) depends upon both the shape parameter \(s>0\) (the number of events we’re waiting for) and the rate parameter \(r>0\) (the rate at which the events occur). In this setting, \(X\) can often be modeled by a Gamma model: \[X \sim \text{Gamma}(s,r)\]

with pdf \[f(x) = \frac{r^s}{\Gamma(s)}*x^{s-1}e^{-r x} \hspace{.2in} \text{ for } x \in [0,\infty)\] NOTE: For positive integers \(s \in \{1,2,...\}\), the gamma function simplifies to \(\Gamma(s) = (s-1)!\). Otherwise, \(\Gamma(s) = \int_0^\infty x^{s-1}e^{-x}dx\).

Properties: \[\begin{split} E(X) & = \frac{s}{r} \\ Mode(X) & = \frac{s-1}{r} \;\; \text{ (if $s>1$)} \\ Var(X) & = \frac{s}{r^2} \\ \end{split}\]

Consider the Gamma model for different parameters \((s,r)\):

In R: Suppose \(X \sim Gamma(s,r)\)

# Calculate f(x)
dgamma(x, shape = s, rate = r)

# Calculate P(X <= x)    
pgamma(x, shape = s, rate = r)

# Plot of f(x) for x from 0 to m (you pick the range)
ggplot(NULL, aes(x = c(0, m))) + 
    stat_function(fun = dgamma, args = list(shape = s, rate = f))




23.5.9 Normal

Let \(X\) be an RV with a bell-shaped model centered at \(\mu\) and with variance \(\sigma^2\). Then \(X\) is often well-approximated by a Normal model: \[X \sim \text{Normal}(\mu,\sigma^2)\] with pdf \[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}}*\exp\left\lbrace - \frac{1}{2}\left(\frac{x-\mu}{\sigma} \right)^2 \right\rbrace \hspace{.2in} \text{ for } x \in (-\infty,\infty)\]

Properties: \[\begin{split} E(X) & = \mu \\ Var(X) & = \sigma^2 \\ \end{split}\]

Consider the Normal model for different parameters \((\mu,\sigma^2)\):

In R: Suppose \(X \sim N(m,s^2)\)

# Calculate f(x)
dnorm(x, mean = m, sd = s)

# Calculate P(X <= x)    
pnorm(x, mean = m, sd = s)

# Plot f(x) for x from a to b (you pick the range)
ggplot(NULL, aes(x = c(a, b))) + 
    stat_function(fun = dnorm, args = list(mean = m, sd = s))