Probability spaces

Naive probability theory is either discrete or continuous. Consider a six-sided die, we can talk about the expected value using the discrete definition \E{X} = \sum_x x ⋅ \Pr{X = x} and find 3.5. Similarly, consider a wheel spinner with values \delim[{0,1}), we can talk about the expected value using the continuous definition \E{X} = \int_0^1 x ⋅ p(x) \d x and find 0.5. But now we do something interesting: we remove the 1 face of the die and in its place we put the spinner. Now when we roll the die, we get either the value 2,3,4,5,6, or a number in \delim[{0,1}). If we roll this die a number of times and take the average, we find it is about 3.4, with a bit of reasoning we can deduce the exact value \frac{41}{12}. The outcomes are neither discrete nor continuous, how do we rigorously define what we intuitively mean (no pun intended) with expected value?

Another limitation of naive probability theory is with distributions such as the Dirac delta function. These are ill-defined in the normal understanding of real numbers and Riemann integrals. But again, they make intuitive sense and are useful in practice. How do we make them rigorous?

Note. I use the operator notation A_B^C as shorthand for A_{C ∈ B} meaning to apply operator A with bound variable C ranging over set B, for example \sum_{[0,n)}^i x_i denotes the sum over x_0, \dots x_{n-1}.

Measure theory

σ-algebra

Definition. Given a set Ω, a σ-algebra on Ω is a set of subsets Σ ⊆ \powerset{Ω} such that it includes itself,

\tag{1} Ω ∈ Σ

is closed under complements, for any S ∈ Σ

\tag{2} \p{Ω \setminus S} ∈ Σ

and is closed under countable unions, for any I ∈ \powerset{Σ} such that \card{I} ≤ \aleph_0

\tag{3} \p{\∪_{S ∈ I} S} ∈ Σ

Note. Within a particular σ-algebra, Ω acts as a universe and I will use \comp E to denote the complement with respect to Ω

\comp E ≜ Ω \setminus E

From the definition follows that ∅ ∈ Ω because ∅ = \comp Ω and for is closed under countable intersections, i.e. given I ∈ \powerset Σ such that \card{I} ≤ \aleph_0

\Intersection_{S ∈ I} S = \comp{\Union_{S ∈ I} \comp S} ∈ Σ

Measure space

Definition. A pair (Ω, Σ) is a measurable space iff Σ is a σ-algebra over Ω.

Note. In the probability theory that follows below, the set Ω will represent the outcome space and Σ the event space.

Definition. Given a measurable space (Ω, Σ), a function μ: Σ → [0, ∞] is a measure on (Ω, Σ) iff

  1. \∀_F^E μ(E) \ge 0,
  2. μ(∅) = 0, and
  3. \∀_{\powerset{Σ}}^S \ \norm S ∈ \N ∧ \∀_S^A \∀_{S \setminus \set{A}}^B A ∩ B = ∅ → μ\p{\∪_S^E E} = \sum_S^E μ(E).

Note. While measure theory allows μ to range over [0, ∞], in probability theory the measure will be a probability and be restricted to [0,1] with μ(Ω) = 1, see below. For now it is kept generic.

Definition. A triple (Ω, Σ, μ) is a measure space iff

  1. Ω is a set.
  2. Σ is a σ-algebra on set Ω.
  3. μ is a measure on (Ω, Σ).

Theorem. Given a measure space (Ω, Σ, μ) then

  1. If S, T ∈ Σ and S ⊂ T then μ(S) ≤ μ(T).
  2. Given countable I ∈ \powerset Σ then μ\p{\Union_{S ∈ I} S} ≤ \sum_{S ∈ I} μ(S).
  3. Given ascending chain S_1 ⊂ S_2 ⊂ S_3 ⊂ ⋯ then μ\p{\Union_i S_i} = \lim_{i \to ∞} μ(S_i).
  4. Given descending chain S_1 ⊃ S_2 ⊃ S_3 ⊃ ⋯ then μ\p{\Intersection_i S_i} = \lim_{i \to ∞} μ(S_i).

Definition. Given a measure space (Ω, Σ, μ), a subset S ∈ Σ is a μ-null set iff μ(S) = 0. A subset S ∈ Σ is a μ-full measure set iff \comp S is a μ-null set.

Definition. Given a measure space (Ω, Σ, μ), a property P of Ω holds μ-almost everywhere iff there exists a μ-null set S such that P(x) for all x ∈ \comp S.

Note. In probability theory, this is also known as μ-almost surely and implies that the probability that the property holds is 1.

Definition. A measure space (Ω, Σ, μ) is a complete measure space iff for every μ-null set S we have \powerset S ⊆ Σ.

Definition. Given a measure space (Ω, Σ, μ), the completion is the smallest extensions such that the measure space is complete.

Lebesgue measure

Lebesgue's decomposition theorem

μ = μ_{\text{continuous}} + \mu_{\text{discrete}} + \mu_{\text{singular}}

Measurable function

To do. Define measurable function.

Pushforward measure

Definition. Given two measurable spaces (Ω_1, Σ_1) and (Ω_2, Σ_2), a measure μ: Σ_1 → \R ∪ \set{+∞} and a measurable function f: Ω_1 → Ω_2, the pushforward measure f_* μ: Σ_2 → \R ∪ \set{+∞} is defined by:

f_* μ (X) ≜ μ\p{f^{-1}\p{B}}

Theorem. The pushforward measure is a measure on (Ω_2, Σ_2).

Lebesgue integration

To do. Define the Lebesgue integral \int_Ω ⋅ \d \Pr{ω}.

Definition. Given

\int_Ω f \d \operatorname{Pr} = \int_Ω f\p{x} \d \Pr{x}

Radon-Nikodym theorem

Theorem. Given a measurable space (Ω, Σ) with two measures μ and ν such that \forall_Σ^A\ μ(A) = 0 → ν(A) = 0 (i.e. ν ≪ μ, ν is absolute continuous with respect to μ), then there exists a measurable function f: Ω → \R ∪ \set{+∞} such that

\forall_Σ^A\ ν(A) = \int_A f(ω) \d μ(ω)

furthermore this function is unique up to a μ-null set.

Definition. Denote the function f from the above theorem as the Radon–Nikodym derivative \frac{\d ν}{\d μ}.

Probability theory

Probability measure

Definition. A measure \operatorname{Pr} on (Ω, Σ) is a probability measure iff \Pr{Ω} = 1.

Theorem. From this it follows \operatorname{Pr}: Σ → [0, 1].

Probability space

Definition. A measure space (Ω, Σ, \operatorname{Pr}) is a probability space iff P(Ω) = 1.

Note. The set Ω is known as the sample space. The members of Ω are known as outcomes. The set Σ is known as the event space and members of Σ are known as events. If and event is a singleton (i.e. contains a single outcome) it is know as an elementary event.

Note. From the above definitions the Kolgomorov axioms are apparent.

To do. Elementary theorems from https://ermongroup.github.io/cs228-notes/preliminaries/probabilityreview/.

From here on, assume we are give a probability space (Ω, Σ, \operatorname{Pr}).

Note. Given an event e ∈ Σ and number n ∈ (0, \infty), the odds of e are “n to \frac{n ⋅ \Pr{e}}{1 - \Pr{e}}” and the odds against e are “\frac{n⋅\p{1 - \Pr{e}}}{\Pr{e}} to n”. If n is left out it is assumed 1. The logit or log-odds of e is \log \frac{\Pr{e}}{1 - \Pr{e}}. The log-probability of e is \log \Pr e.

To do. https://en.wikipedia.org/wiki/Odds_ratio https://en.wikipedia.org/wiki/Risk_ratio Values from here https://en.wikipedia.org/wiki/Odds_ratio#Numerical_example, https://en.wikipedia.org/wiki/Odds_ratio#See_also https://en.wikipedia.org/wiki/Category:Summary_statistics_for_contingency_tables

Definition. Given a basis b \in \R, the information content \operatorname I :Σ → \R ∪ \set{+∞} is a measure on (Ω, Σ) defined by

\operatorname I \p e ≜ - \log_b \Pr e

Note. The information content is also called self-information, surprisal or Shannon information.

Note. For basis b=2 the units of \operatorname I are called bits or shannons, for b=\operatorname e they are called nats and for b=10 hartleys. These are collectively units of information. From here on if the basis is not specified it is \operatorname e.

Conditional probability

Definition. Given A, B ∈ Σ

\Prc AB ≜ \frac{\Pr{A ∩ B}}{\Pr B}

To do. Likelihood and such.

Theorem. (Bayes rule)

\Prc AB = \Prc BA ⋅ \frac{\Pr{A}}{\Pr B}

Proof. Expand the definition of conditional probability

\frac{\Pr{A ∩ B}}{\Pr B} = \frac{\Pr{B ∩ A}}{\Pr A} ⋅ \frac{\Pr{A}}{\Pr B}

Independence

Definition. Given A, B ∈ Σ, A and B are independent iff

\Pr{A ∩ B} = \Pr A ⋅ \Pr B

References

Remco Bloemen
Math & Engineering
https://2π.com