Giry Monad = Probability Monad ============================== Measurable Space ---------------- The category is of measurable spaces and measurable functions: * An object is a measurable space: A set of points X with a σ-algebra. I also use X to refer to the whole structure, and σX to refer to the σ-algebra.. * A morphism f: A → B is a measurable function: For each S: σB, invIm f S : σA. So it is OK to assert invIm f : σB → σA. An important measurable space is the unit interval UI with its standard measurable subsets: * UI = [0,1] * σUI = say, Borel σ-algebra (smallest σ-algebra having all open subsets). Could also be larger, e.g., Carathéodory's definition of Lesbegue-measurable sets. Doesn't really matter. It will be the codomain of a lot of functions later, since we are doing probability! Measure & Integral ------------------ A measure is a function μ : σX → ℝ≥0,+∞ that is σ-additive and μ {} = 0. We are more interested in: A probability measure is a function μ : σX → [0,1] that is σ-additive, μ X = 1. (Then μ {} = 0 is derivable: μ X + μ {} = μ X.) Recall that X becomes the sample space, σX the event space. :) We can integrate a measurable function f: X → ℝ,±∞ (conditions apply) over the whole space X wrt a measure μ, traditionally written ∫_{X} f dμ, but will also write it as int[X] μ f (and the “[X]” is optional type annotation) We care about this because it gives the weighted average a.k.a. expected value of f if μ is a probability measure. (Recall that f is a.k.a. a random variable!) (Traditionally there is also ∫_{E} f dμ for arbitrary E: σX. I don't need it often in these notes. When it is needed, it's ∫_{X} (f × char E) dμ or any way of restricting f to E.) For a simple function f (linear combination of characteristic functions that use measurable sets), e.g., f = λx. Σi:0,..k. a i × char (E i) x the integral is given by int μ f = Σi:0,..k. a i × μ (E i) For a non-negatvie measurable function f: X → ℝ≥0,+∞, the following two theorems combine to avail a limiting process by a sequence of simple functions. Theorem (Simple Limit): There exists a sequence ϕ of simple functions X → ℝ≥0 s.t. * (pointwise) lim n. ϕ n = f * (pointwise) ϕ n ≤ ϕ (n+1) (so ≤ the limit f, too; and lim = sup) * ϕ n is as bounded as f * (if μ is σ-finite, i.e., can break X into countably many finite-μ parts) ϕ n vanishes outside a set of finite measure (Here is one way: cap the y-axis at n, subdivide into n×2^n divisions (so each [i, i+1] has 2^n divisions), use that to pixelate f, under-approximate.) Theorem (Monotone Convergence, MC): Let ϕ be a sequence of measurable functions X → ℝ≥0,+∞ s.t. * (pointwise, a.e.) lim n. ϕ n = f * (pointwise) ϕ n ≤ f Then int μ f = lim n. int μ (ϕ n) Most integrals in these notes compute probabilities, so we focus on functions to codomain ℝ≥0,+∞ or even [0,1]; but integrals can also compute expected values of random variables, where it is useful to allow negative numbers, e.g., the expected value of a gamble. Signed integrals are defined by subtracting two non-negative integrals: int μ f = int μ (f ⊔ const 0) - int μ (-f ⊔ const 0) provided that it is not ∞ - ∞. Below are a few more handy properties of integration. Proofs are by SL, MC, and sign-extension. Theorem (Change of Variable, CV): Let k: X → Z, f: Z → ℝ±∞, μ a measure on X. int[Z] (μ ∘ invIm k) f = int[X] μ (f ∘ k) Theorem: (λE:σX. int μ (f × char E)) is a measure. (If f is signed, we get a signed measure; not used in these notes.) Intuition: Such as measure is a weighted measure, the weight function being f. (This is relative to μ, if we think of μ as an unweighted, uniform measure.) And so we have two ways of expressing weighted integration: Theorem (Weighted Integration): Let w, f: X → ℝ,±∞. int (λE:σX. int μ (w × char E)) f = int μ (w × f) Corollary (Importance Sampling): Let g: X → ℝ,±∞, i: X → R>0. int μ g = int (λE:σX. int μ (i × char E)) (g / i) Proof: Set w := i, f := g/i. In applications of importance sampling, the integral is computed by Monte Carlo integration, usually i: X → (0,1] and is a probability density function, and so we sample X biasedly and sum g/i. If i is chosen wisely (so g/i doesn't vary wildly), this can give a good accuracy with a modest sample size. Giry Monad ---------- The Giry monad is the functor Pr and monad defined below. (Proof of functoriality and monadness in later sections.) ### Object map Pr X = {μ : σX → [0,1] | μ is σ-additive, μ X = 1} So, set of probability measures of X. σ(Pr X) = σPr X = smallest σ-algebra having invIm (eval E) i for every E:σX, i:σUI. eval is defined as: eval : σX → Pr X → UI eval = λE. λμ. μ E IOW we want every eval E : Pr X → UI to be a measurable function, and make σ(Pr X) corroborate. There are multiple motivations: * The example X = {head, tail} so Pr X = Bernoulli distributions, perhaps ≅ [0,1] via eval {head}, which perhaps should be made measurable. This is worked out in a later section. * The definition of join below integrates an eval E. * We would like functions of the form (λμ: Pr A. int[A] μ f) to be measurable. (join is such a function.) In the simplest case, f = char E, (λμ: Pr A. int μ (char E)) = eval E. Understanding σPr X (which defines the structure of Pr X, e.g., what are its measurable functions) can be hard for arbitrary X, or even for many common cases such as when X=ℝ. But we always know two things: it makes eval E measurable, and it is the smallest for that, so there is an induction principle. Below we state and prove two useful consequences, one concerning A → Pr B, the other Pr A → UI (or generally Pr A → an interval). For a function in Set(A, Pr B) to be measurable, induction on σPr B is involved. The lemma below saves the repetitive part; it also characterizes all such functions. Lemma (Measure-Valued Function, MVF): Let f: Set(A, Pr B). f: A → Pr B iff (∀E:σB. flip f E : A → UI). We note that flip f E has several useful forms: flip f E = (λa:A. f a E) = eval E ∘ f. Remark: Recall that morphisms like f: A → Pr B are Kleisli arrows; they are also a general form of probability transition matrices (“Markov kernels”) for Markov chains. The lemma is connected to how Markov kernels are defined (though traditionally it is uncurry f that is a Markov kernel): κ: Set(A × σB, [0,1]) is a Markov kernel iff: * ∀a:A. κ(a, -): Pr B * ∀E:σB. κ(-, E): A → [0,1]. (So, as per the lemma.) Proof: “⇒” is easy by composing two measurable functions. For “⇐”, we prove (∀E: σPr B. invIm f E : σA) by induction on E: * invIm f {} : σA Because invIm preserves {}, which is in σA. * ∀E:σPr B. invIm f E : σA ⇒ invIm f (Pr B - E) : σA Because invIm preserves complement, which σA is closed under. * ∀E:ℕ→σPr B. (∀n. invIm f (E n) : σA) ⇒ invIm f (⋃E) : σA Because invIm preserves countable (even arbitrary) union, which σA is closed under. * ∀S:σB, i:σUI. invIm f (invIm (eval S) i) : σA Because of the premise, and flip f S = eval S ∘ f. For Pr A → interval, the lemma below constructs a lot more members than eval E. Here is a motivation. eval E = (λμ. int μ (char E)). If this is a measurable function, perhaps we can generalize from char E to arbitrary f. Lemma (Function of Measures from Integration, FMI): Let I be a closed interval subspace of ℝ,±∞, f: A→I. (E.g, I can be UI, ℝ≥0,+∞, ℝ,±∞.) (λμ: Pr A. int μ f) : Pr A → I. (In case f is a signed function, on condition that the integral makes sense for all μ.) Proof: The output interval is still I because μ is bounded in [0,1]. Measurability progresses from characteristic, then simple, to general: * If f = char E, where E:σA: λμ: Pr A. int μ (char E) = λμ: Pr A. μ E = eval E is measurable by construction of σ(Pr A). * If f is a simple function, i.e., a linear combination of characteristic functions: measurability is closed under linear combination. * MC-extend to non-negative f: By SL, use an increasing sequence ϕ of simple functions to converge to f: λμ: Pr A. int μ f = λμ: Pr A. sup n. int μ (ϕ n) sup of funcs = pointwise extend = sup n. λμ: Pr A. int μ (ϕ n) measurability is closed under sup. * Extend to signed f: the subtraction of two measurable functions is measurable. Remark: The proof uses closure of measurable functions under linear combinations and sup, seeded by eval E. The closure corresponds to applying σ-algebra operations to measurable sets, and the seeding corresponds to invIm (eval E) i. So the proof parallels the construction of σ(Pr A). ### Morphism map Pr f : Pr X → Pr Y Pr f = λμ. μ ∘ invIm f This is well motivated by considering examples of changes of variables, programming to the type, or borrowing from the composite functor Hom(Powerset(-), UI). Proof that this is a measurable function: By MVF, let S:σB, then λμ. (Pr f μ) S = λμ. (μ ∘ invIm f) S = λμ. μ (invIm f S) = eval (invIm f S) invIm f S : σX : Pr X → UI Corollary (Change of Variable but with Functor): Let f: A→B, μ: Pr A, g: B→ℝ,±∞. int[B] (Pr f μ) g = int[B] (μ ∘ invIm f) g = int[A] μ (g ∘ f) ### unit unit: ∀X. X → Pr X unit[X] x = λs:σX. char s x = λs:σX. if x∈s then 1 else 0 A.k.a. Dirac measure. It stands for a refreshing desert of determinism in the oasis of stochasticism. >:) (Exercise: Check that unit x is a probability measure.) Proof that unit is a measurable function: By MVF, let S:σX, then λx:X. unit x S = λx:X. char S x = char S simple functions are measurable : X → UI Here are some useful and insightful properties: Lemma (Dirac 1-Point): Let P: X → Bool be a measurable predicate (true on a measurable set, false on another; exactly a measurable function to Bool). Under the measure unit x, (a.e. P) iff P x. Proof: X - {x} is a set of zero measure so it doesn't matter. >:) Theorem (Sifting): int (unit x) f = f x Proof A: by MC and SL. Proof B: By Dirac 1-Point, a.e. f = const (f x) >:) The following theorem shows how Dirac measures arise when mapping distributions from a discrete space to an arbitrary space. WLOG we use ℕ for the discrete space: Theorem (Discrete Distribution to Impulse Train): Let k: ℕ → Y, μ: Pr ℕ. Pr k μ = Σn:ℕ. const (μ {n}) × unit (k n) = λE:σY. Σn:ℕ. μ {n} × unit (k n) E Proof: First note that μ is a countable sum because every E: σN is a countable disjoint union of singletons. μ = λE. Σn:E. μ {n} = λE. Σn:ℕ. μ {n} × char E n Pr k μ = λE. μ (invIm k E) = λE. Σn. μ {n} × char (invIm k E) n = λE. Σn. μ {n} × char E (k n) = λE. Σn. μ {n} × unit (k n) E pointwise lift + and × = Σn. const (μ {n}) × unit (k n) Remark: We don't need k to be injective! Even though in practice k is usually made injective for less confusion. And generalizing from sifting to countable sums of Dirac measures: Theorem (Impulse Train): Let a: ℕ → ℝ, k: ℕ → Y, f: Y → a closed interval. int[Y] (λE. Σn. a n × unit (k n) E) f = Σn. a n × f (k n) Proof A: Define discrete measure μ s.t. μ {n} = a n. If a is an affine sequence (non-negative, sums to 1), then μ is a probability measure, and we can use the above theorem: int[Y] (λE. Σn. a n × unit (k n) E) f = int[Y] (Pr k μ) f change of variable = int[ℕ] μ (f ∘ k) = Σn. a n × f (k n) Proof B: If a is more arbitrary, can still prove that integration is σ-additive and linear in measures (by SL and MC), then use sifting termwise. ### join join: ∀X. Pr(Pr X) → Pr X join[X] ν = λE:σX. int[Pr X] ν (eval E) Straightforward to verify that join ν is a probability measure: bounded in [0,1]; join ν X = 1; is σ-additive because eval is σ-additive and integration preserves countable sums. Proof that join is a measurable function: By MVF, it suffices to prove, for all S:σ(Pr X), flip join S : Pr(Pr X) → UI thusly: flip join S = λν: Pr(Pr X). int ν (eval S) which is a measurable function by FMI. What it does by example: I have a naughty coin, its probability of head is probabilistic, too! (What a nightmare!) Or you can imagine the saner scenerio (but just as complicated) that first I get a coin probabilistically, then toss it. It's not too bad, we can manage: * With probability 1/3, probability of head is 1/2. * With probability 2/3, probability of head is 1/5. So the total black-box probability of head is: 1/3 × 1/2 + 2/3 × 1/5 the expected value of the inner head probabilities, wrt the outer probability measure. Let's do it formally: The measurable space of tossing: X = {head, tail}, σX = Powerset X. The two inner probability measures: μ2, μ5 : Pr X μ2{head} = 1/2 = eval {head} μ2 μ5{head} = 1/5 = eval {head} μ5 The outer probability measure: ν : Pr(Pr X) ν = λE : σ(Pr X) . (1/3) × unit μ2 E + (2/3) × unit μ5 E The expected value of “probability of head” is the expected value of this function: eval {head} : Pr X → [0,1] wrt the outer probability measure ν: int ν (eval {head}) = join ν {head} Let's calculate it: int ν (eval {head}) impulse train = (1/3) × eval {head} μ2 + (2/3) × eval {head} μ5 = 1/3 × 1/2 + 2/3 × 1/5 The following theorem confirms that an expected value across a total distribution is a double integral: expected value (across the outer distribution) of expected values (across inner distributions). Theorem: Let I be a closed interval subspace of ℝ,±∞, f: X → I, ν: Pr (Pr X). int[X] (join[X] ν) f = int[Pr X] ν (λμ: Pr X. int[X] μ f) Proof: By SL and MC on non-negative f, then sign-extend: * f is simple, e.g., f = λx. Σi:0,..n. a i × char (E i) x int (join ν) f = Σi:0,..n. a i × join ν (E i) = Σi:0,..n. a i × int ν (λμ. μ (E i)) = int ν (λμ. Σi:0,..n. a i × μ (E i)) = int ν (λμ. int μ X f) * f is non-negative: Let ϕ be a sequence of simple functions s.t. f = lim n. ϕ n, ϕ n ≤ f. int (join ν) f = lim n. int (join ν) (ϕ n) works for simple = lim n. int ν (λμ. int μ (ϕ n)) Note: (λμ. int μ (ϕ n)) ≤ (λμ. int μ f). Apply MC twice: = int ν (λμ. int μ (lim n. ϕ n)) = int ν (λμ. int μ f) * f is signed: Integration commutes with (pointwise) subtraction. ### bind (>>=) Although not entirely necessary, it is convenient to also know: (>>=) : Pr X → (X → Pr Y) → Pr Y μ >>= k = λE:σY. int[X] μ (λx:X. k x E) = λE:σY. int[X] μ (flip k E) Motivation: k x is the conditional probability that if x:X has been picked, what happens to randomly picking from Y. Integrating this over X wrt μ gives the total probability, where μ controls picking from X. For discrete X, it is the very familiar and intuitive λE:σY. Σx:X. Prob(x is picked) × Prob(E | x is picked) We could re-do the naughty-coin example using this. X = {the 1/2 coin, the 1/5 coin}, Y = {head, tail} μ {the 1/2 coin} = 1/3 μ {the 1/5 coin} = 2/3 k (the 1/2 coin) = μ2 k (the 1/5 coin) = μ5 Then the total probability of head is (μ >>= k) {head}. We now prove that join (Pr k μ) gives the same formula: join (Pr k μ) = λE:σY. int (Pr k μ) (eval E) change of variable = λE:σY. int μ (eval E ∘ k) = λE:σY. int μ (λx:X. k x E) = μ >>= k Relating this to the example, indeed Pr k μ = (1/3) × unit μ2 + (2/3) × unit μ5 as per mapping a discrete distribution to an impulse train. By now we know the drill that every time we construct a probability measure, we find out what happens when integrating wrt it: Theorem: Let μ: Pr X, k: X → Pr Y, f: Y → ℝ,±∞. int[Y] (μ >>= k) f = int[X] μ (λx. int[Y] (k x) f) Proof: int (μ >>= k) f = int (join (Pr k μ)) f integrate wrt a join = int (Pr k μ) (λν. int ν f) change of variable = int μ ((λν. int ν f) ∘ k) = int μ (λx. int (k x) f) So it is a double integral again. ### Kleisli Arrow (>=>) The formulas for (>=>) of Pr look irregular and ugly, but read on, there is a nicely regular result at the end! (>=>) :: (A → Pr B) → (B → Pr C) → (A → Pr C) f >=> g = join[C] ∘ Pr g ∘ f = λa:A. λE:σC. int[B] (f a) (flip g E) Well OK, it looks a bit more symmetric in uncurried Markov kernel form: λ(a,E). int f(a,-) g(-,E) What's nice is when we integrate wrt an ((f >=> g) a): Theorem: Let h: C → ℝ,±∞. (λa:A. int ((f >=> g) a) h) = (λa:A. int (f a) (λb:B. int (g b) h)) So, just levels of the form “λx. int (foo x) ...”, which really looks like chaining now! Proof: λa. int ((f >=> g) a) h = λa. int (join (Pr g (f a))) h int wrt join = λa. int (Pr g (f a)) (λμ. int μ h) change of variable = λa. int (f a) ((λμ. int μ h) ∘ g) = λa. int (f a) (λb. int (g b) h) Proofs of Satisfaction of Functor Laws and Monad Laws ----------------------------------------------------- ### Functor Laws Reuse Hom(PowerSet(-), [0,1]) results or do it again: Pr id[X] μ = μ ∘ invIm id[X] = μ ∘ id[σX] = μ (Pr g ∘ Pr f) μ = Pr g (μ ∘ invIm f) = μ ∘ invIm f ∘ invIm g = μ ∘ invIm (g ∘ f) = Pr (g ∘ f) μ ### unit is natural Let f: X → Y. WTP: Pr f ∘ unit[X] = unit[Y] ∘ f Pr f (unit x) = λE. unit x (invIm f E) = λE. unit x {z | f z ∈ E} = λE. if f x ∈ E then 1 else 0 = unit (f x) ### join is natural Let f: X → Y. WTP: Pr f ∘ join[X] = join[Y] ∘ Pr (Pr f) Pr f (join ν) = λE. join ν (invIm f E) = λE. int ν (eval (invIm f E)) = λE. int ν (λμ. μ (invIm f E)) join (Pr (Pr f) ν) = λE. int (Pr (Pr f) ν) (eval E) change of variable = λE. int ν (eval E ∘ Pr f) = λE. int ν (λμ. eval E (μ ∘ invIm f)) = λE. int ν (λμ. μ (invIm f E)) ### unit is like identity to join Two commuting triangles. WTP: join[X] ∘ unit[Pr X] = id[Pr X] join (unit μ) = λE. int (unit μ) (eval E) sifting = λE. eval E μ = μ WTP: join[X] ∘ Pr unit[X] = id[Pr X] join (Pr unit μ) = λE. int (Pr unit μ) (eval E) change of variable = λE. int μ (eval E ∘ unit) = λE. int μ (λx. unit x E) = λE. int μ (char E) = λE. μ E = μ ### join is like associative WTP: join[X] ∘ join[Pr X] = join[X] ∘ Pr join[X] join (join ν) = λE. int (join ν) (eval E) = λE. int ν (λμ. int μ (eval E)) join (Pr join ν) = λE. int (Pr join ν) (eval E) change of variable = λE. int ν (eval E ∘ join) = λE. int ν (λμ. int μ (eval E)) Example Why We want eval's Measurable ------------------------------------- An example when Pr X and σ(Pr X) can be easily visualized is X = {head, tail} μ: Pr X is completely determined by μ {head}: UI. We have a bijection: eval {head} : Set(Pr X, UI) eval {head} = λμ. μ {head} Bm: Set(UI, Pr X) Bm = λr. λE. r × char E head + (1-r) × char E tail (Bm stands for Bernoulli measure.) It looks like we should make them measurable functions (morphisms in our category) too, so that the bijection is an isomorphism. If we are to require eval {head} to be a measurable function, it stands to reason to extend from {head} to all measurable subsets, so that the principle is not fine-tuned to this example. This turns out to be just right. Let's show that this makes Bm a measurable function. If σ(Pr X) makes all eval's measurable, and is the smallest possible, then MVF holds; so Bm is measurable because for all E: σUI, flip Bm E = λr. r × char E head + (1-r) × char E tail = λr. r × some constant + (1-r) × another constant is measurable. Then invIm Bm and invIm (eval {head}) also form a bijection between σ(Pr X) and σUI. The argument is partly specific to measure theory (actually any theory that has a similar invIm requirement, e.g., this works for topology too), partly general in set theory: * Specific to measure theory: The two domains σ(Pr X) and σUI actually work, from the definition of measurable functions. * General set theory: If f ∘ g = id, then invIm g ∘ invIm f = id.