Giry Monad = Probability Monad
==============================


Measurable Space
----------------

The category is of measurable spaces and measurable functions:

* An object is a measurable space: A set of points X with a σ-algebra.  I also
  use X to refer to the whole structure, and σX to refer to the σ-algebra..

* A morphism f: A → B is a measurable function: For each S: σB,
  invIm f S : σA.  So it is OK to assert invIm f : σB → σA.

An important measurable space is the unit interval UI with its standard
measurable subsets:

* UI = [0,1]

* σUI = say, Borel σ-algebra (smallest σ-algebra having all open subsets).

  Could also be larger, e.g., Carathéodory's definition of Lesbegue-measurable
  sets.  Doesn't really matter.

It will be the codomain of a lot of functions later, since we are doing
probability!


Measure & Integral
------------------

A measure is a function μ : σX → ℝ≥0,+∞ that is σ-additive and μ {} = 0.

We are more interested in: A probability measure is a function μ : σX → [0,1]
that is σ-additive, μ X = 1.  (Then μ {} = 0 is derivable: μ X + μ {} = μ X.)
Recall that X becomes the sample space, σX the event space. :)

We can integrate a measurable function f: X → ℝ,±∞ (conditions apply) over the
whole space X wrt a measure μ, traditionally written ∫_{X} f dμ, but will also
write it as

int[X] μ f    (and the “[X]” is optional type annotation)

We care about this because it gives the weighted average a.k.a. expected value
of f if μ is a probability measure.  (Recall that f is a.k.a. a random
variable!)

(Traditionally there is also ∫_{E} f dμ for arbitrary E: σX.  I don't need it
often in these notes.  When it is needed, it's ∫_{X} (f × char E) dμ or any way
of restricting f to E.)

For a simple function f (linear combination of characteristic functions that use
measurable sets), e.g.,

f = λx. Σi:0,..k. a i × char (E i) x

the integral is given by

int μ f = Σi:0,..k. a i × μ (E i)

For a non-negatvie measurable function f: X → ℝ≥0,+∞, the following two theorems
combine to avail a limiting process by a sequence of simple functions.

Theorem (Simple Limit):
There exists a sequence ϕ of simple functions X → ℝ≥0 s.t.

* (pointwise) lim n. ϕ n = f
* (pointwise) ϕ n ≤ ϕ (n+1) (so ≤ the limit f, too; and lim = sup)
* ϕ n is as bounded as f
* (if μ is σ-finite, i.e., can break X into countably many finite-μ parts)
  ϕ n vanishes outside a set of finite measure

(Here is one way: cap the y-axis at n, subdivide into n×2^n divisions (so each
[i, i+1] has 2^n divisions), use that to pixelate f, under-approximate.)

Theorem (Monotone Convergence, MC):
Let ϕ be a sequence of measurable functions X → ℝ≥0,+∞ s.t.

* (pointwise, a.e.) lim n. ϕ n = f
* (pointwise) ϕ n ≤ f

Then int μ f = lim n. int μ (ϕ n)

Most integrals in these notes compute probabilities, so we focus on functions to
codomain ℝ≥0,+∞ or even [0,1]; but integrals can also compute expected values of
random variables, where it is useful to allow negative numbers, e.g., the
expected value of a gamble.  Signed integrals are defined by subtracting two
non-negative integrals:

int μ f = int μ (f ⊔ const 0) - int μ (-f ⊔ const 0)

provided that it is not ∞ - ∞.

Below are a few more handy properties of integration.  Proofs are by SL, MC, and
sign-extension.

Theorem (Change of Variable, CV):
Let k: X → Z, f: Z → ℝ±∞, μ a measure on X.
int[Z] (μ ∘ invIm k) f = int[X] μ (f ∘ k)

Theorem: (λE:σX. int μ (f × char E)) is a measure.  (If f is signed, we get a
signed measure; not used in these notes.)

Intuition: Such as measure is a weighted measure, the weight function being f.
(This is relative to μ, if we think of μ as an unweighted, uniform measure.)

And so we have two ways of expressing weighted integration:

Theorem (Weighted Integration):
Let w, f: X → ℝ,±∞.
int (λE:σX. int μ (w × char E)) f = int μ (w × f)

Corollary (Importance Sampling):
Let g: X → ℝ,±∞, i: X → R>0.
int μ g = int (λE:σX. int μ (i × char E)) (g / i)

Proof: Set w := i, f := g/i.

In applications of importance sampling, the integral is computed by Monte Carlo
integration, usually i: X → (0,1] and is a probability density function, and so
we sample X biasedly and sum g/i.  If i is chosen wisely (so g/i doesn't vary
wildly), this can give a good accuracy with a modest sample size.


Giry Monad
----------

The Giry monad is the functor Pr and monad defined below.  (Proof of
functoriality and monadness in later sections.)

### Object map

Pr X = {μ : σX → [0,1] | μ is σ-additive, μ X = 1}

So, set of probability measures of X.

σ(Pr X) = σPr X = smallest σ-algebra having invIm (eval E) i for every E:σX,
i:σUI.

eval is defined as:

eval : σX → Pr X → UI
eval = λE. λμ. μ E

IOW we want every eval E : Pr X → UI to be a measurable function, and make σ(Pr
X) corroborate.  There are multiple motivations:

* The example X = {head, tail} so Pr X = Bernoulli distributions, perhaps ≅
  [0,1] via eval {head}, which perhaps should be made measurable.  This is
  worked out in a later section.
* The definition of join below integrates an eval E.
* We would like functions of the form (λμ: Pr A. int[A] μ f) to be measurable.
  (join is such a function.)  In the simplest case, f = char E,
  (λμ: Pr A. int μ (char E)) = eval E.

Understanding σPr X (which defines the structure of Pr X, e.g., what are its
measurable functions) can be hard for arbitrary X, or even for many common cases
such as when X=ℝ.  But we always know two things: it makes eval E measurable,
and it is the smallest for that, so there is an induction principle.  Below we
state and prove two useful consequences, one concerning A → Pr B, the other
Pr A → UI (or generally Pr A → an interval).

For a function in Set(A, Pr B) to be measurable, induction on σPr B is involved.
The lemma below saves the repetitive part; it also characterizes all such
functions.

Lemma (Measure-Valued Function, MVF):
Let f: Set(A, Pr B).  f: A → Pr B iff (∀E:σB. flip f E : A → UI).

We note that flip f E has several useful forms:

flip f E = (λa:A. f a E) = eval E ∘ f.

Remark: Recall that morphisms like f: A → Pr B are Kleisli arrows; they are also
a general form of probability transition matrices (“Markov kernels”) for Markov
chains.  The lemma is connected to how Markov kernels are defined (though
traditionally it is uncurry f that is a Markov kernel):

κ: Set(A × σB, [0,1]) is a Markov kernel iff:

* ∀a:A. κ(a, -): Pr B
* ∀E:σB. κ(-, E): A → [0,1].  (So, as per the lemma.)

Proof: “⇒” is easy by composing two measurable functions.  For “⇐”, we prove
(∀E: σPr B. invIm f E : σA) by induction on E:

* invIm f {} : σA

  Because invIm preserves {}, which is in σA.

* ∀E:σPr B. invIm f E : σA ⇒ invIm f (Pr B - E) : σA

  Because invIm preserves complement, which σA is closed under.

* ∀E:ℕ→σPr B. (∀n. invIm f (E n) : σA) ⇒ invIm f (⋃E) : σA

  Because invIm preserves countable (even arbitrary) union, which σA is closed
  under.

* ∀S:σB, i:σUI. invIm f (invIm (eval S) i) : σA

  Because of the premise, and flip f S = eval S ∘ f.

For Pr A → interval, the lemma below constructs a lot more members than eval E.
Here is a motivation.  eval E = (λμ. int μ (char E)).  If this is a measurable
function, perhaps we can generalize from char E to arbitrary f.

Lemma (Function of Measures from Integration, FMI):
Let I be a closed interval subspace of ℝ,±∞, f: A→I.
(E.g, I can be UI, ℝ≥0,+∞, ℝ,±∞.)
(λμ: Pr A. int μ f) : Pr A → I.
(In case f is a signed function, on condition that the integral makes sense for
all μ.)

Proof:

The output interval is still I because μ is bounded in [0,1].

Measurability progresses from characteristic, then simple, to general:

* If f = char E, where E:σA:

    λμ: Pr A. int μ (char E)
  = λμ: Pr A. μ E
  = eval E
  is measurable by construction of σ(Pr A).

* If f is a simple function, i.e., a linear combination of characteristic
  functions: measurability is closed under linear combination.

* MC-extend to non-negative f: By SL, use an increasing sequence ϕ of simple
  functions to converge to f:

    λμ: Pr A. int μ f
  = λμ: Pr A. sup n. int μ (ϕ n)    sup of funcs = pointwise extend
  = sup n. λμ: Pr A. int μ (ϕ n)

  measurability is closed under sup.

* Extend to signed f: the subtraction of two measurable functions is measurable.

Remark: The proof uses closure of measurable functions under linear combinations
and sup, seeded by eval E.  The closure corresponds to applying σ-algebra
operations to measurable sets, and the seeding corresponds to invIm (eval E) i.
So the proof parallels the construction of σ(Pr A).


### Morphism map

Pr f : Pr X → Pr Y
Pr f = λμ. μ ∘ invIm f

This is well motivated by considering examples of changes of variables,
programming to the type, or borrowing from the composite functor
Hom(Powerset(-), UI).

Proof that this is a measurable function: By MVF, let S:σB, then

  λμ. (Pr f μ) S
= λμ. (μ ∘ invIm f) S
= λμ. μ (invIm f S)
= eval (invIm f S)          invIm f S : σX
: Pr X → UI

Corollary (Change of Variable but with Functor):
Let f: A→B, μ: Pr A, g: B→ℝ,±∞.
int[B] (Pr f μ) g = int[B] (μ ∘ invIm f) g = int[A] μ (g ∘ f)


### unit

unit: ∀X. X → Pr X
unit[X] x = λs:σX. char s x
          = λs:σX. if x∈s then 1 else 0

A.k.a. Dirac measure.  It stands for a refreshing desert of determinism in the
oasis of stochasticism. >:)

(Exercise: Check that unit x is a probability measure.)

Proof that unit is a measurable function: By MVF, let S:σX, then

  λx:X. unit x S
= λx:X. char S x
= char S             simple functions are measurable
: X → UI

Here are some useful and insightful properties:

Lemma (Dirac 1-Point):
Let P: X → Bool be a measurable predicate (true on a measurable set, false on
another; exactly a measurable function to Bool).  Under the measure unit x,
(a.e. P) iff P x.

Proof: X - {x} is a set of zero measure so it doesn't matter. >:)

Theorem (Sifting):

int (unit x) f = f x

Proof A: by MC and SL.

Proof B: By Dirac 1-Point, a.e. f = const (f x) >:)

The following theorem shows how Dirac measures arise when mapping distributions
from a discrete space to an arbitrary space.  WLOG we use ℕ for the discrete
space:

Theorem (Discrete Distribution to Impulse Train):
Let k: ℕ → Y, μ: Pr ℕ.
Pr k μ = Σn:ℕ. const (μ {n}) × unit (k n)
       = λE:σY. Σn:ℕ. μ {n} × unit (k n) E

Proof: First note that μ is a countable sum because every E: σN is a countable
disjoint union of singletons.

μ = λE. Σn:E. μ {n}
  = λE. Σn:ℕ. μ {n} × char E n

  Pr k μ
= λE. μ (invIm k E)
= λE. Σn. μ {n} × char (invIm k E) n
= λE. Σn. μ {n} × char E (k n)
= λE. Σn. μ {n} × unit (k n) E               pointwise lift + and ×
= Σn. const (μ {n}) × unit (k n)

Remark: We don't need k to be injective! Even though in practice k is usually
made injective for less confusion.

And generalizing from sifting to countable sums of Dirac measures:

Theorem (Impulse Train):
Let a: ℕ → ℝ, k: ℕ → Y, f: Y → a closed interval.
int[Y] (λE. Σn. a n × unit (k n) E) f = Σn. a n × f (k n)

Proof A: Define discrete measure μ s.t. μ {n} = a n.  If a is an affine sequence
(non-negative, sums to 1), then μ is a probability measure, and we can use the
above theorem:

  int[Y] (λE. Σn. a n × unit (k n) E) f
= int[Y] (Pr k μ) f                        change of variable
= int[ℕ] μ (f ∘ k)
= Σn. a n × f (k n)

Proof B: If a is more arbitrary, can still prove that integration is σ-additive
and linear in measures (by SL and MC), then use sifting termwise.


### join

join: ∀X. Pr(Pr X) → Pr X
join[X] ν = λE:σX. int[Pr X] ν (eval E)

Straightforward to verify that join ν is a probability measure: bounded in
[0,1]; join ν X = 1; is σ-additive because eval is σ-additive and integration
preserves countable sums.

Proof that join is a measurable function: By MVF, it suffices to prove, for all
S:σ(Pr X), flip join S : Pr(Pr X) → UI thusly:

flip join S = λν: Pr(Pr X). int ν (eval S)

which is a measurable function by FMI.

What it does by example: I have a naughty coin, its probability of head is
probabilistic, too!  (What a nightmare!)  Or you can imagine the saner
scenerio (but just as complicated) that first I get a coin probabilistically,
then toss it.  It's not too bad, we can manage:

* With probability 1/3, probability of head is 1/2.
* With probability 2/3, probability of head is 1/5.

So the total black-box probability of head is:

1/3 × 1/2 + 2/3 × 1/5

the expected value of the inner head probabilities, wrt the outer probability
measure.

Let's do it formally:

The measurable space of tossing: X = {head, tail}, σX = Powerset X.

The two inner probability measures:
μ2, μ5 : Pr X
μ2{head} = 1/2 = eval {head} μ2
μ5{head} = 1/5 = eval {head} μ5

The outer probability measure:
ν : Pr(Pr X)
ν = λE : σ(Pr X) . (1/3) × unit μ2 E + (2/3) × unit μ5 E

The expected value of “probability of head” is the expected value of this
function:

eval {head} : Pr X → [0,1]

wrt the outer probability measure ν:

int ν (eval {head}) = join ν {head}

Let's calculate it:

  int ν (eval {head})                                impulse train
= (1/3) × eval {head} μ2 + (2/3) × eval {head} μ5
= 1/3 × 1/2 + 2/3 × 1/5

The following theorem confirms that an expected value across a total
distribution is a double integral: expected value (across the outer
distribution) of expected values (across inner distributions).

Theorem:
Let I be a closed interval subspace of ℝ,±∞, f: X → I, ν: Pr (Pr X).
int[X] (join[X] ν) f = int[Pr X] ν (λμ: Pr X. int[X] μ f)

Proof: By SL and MC on non-negative f, then sign-extend:

* f is simple, e.g., f = λx. Σi:0,..n. a i × char (E i) x

    int (join ν) f
  = Σi:0,..n. a i × join ν (E i)
  = Σi:0,..n. a i × int ν (λμ. μ (E i))
  = int ν (λμ. Σi:0,..n. a i × μ (E i))
  = int ν (λμ. int μ X f)

* f is non-negative: Let ϕ be a sequence of simple functions s.t.
  f = lim n. ϕ n, ϕ n ≤ f.

    int (join ν) f
  = lim n. int (join ν) (ϕ n)               works for simple
  = lim n. int ν (λμ. int μ (ϕ n))
  Note: (λμ. int μ (ϕ n)) ≤ (λμ. int μ f).  Apply MC twice:
  = int ν (λμ. int μ (lim n. ϕ n))
  = int ν (λμ. int μ f)

* f is signed: Integration commutes with (pointwise) subtraction.


### bind (>>=)

Although not entirely necessary, it is convenient to also know:

(>>=) : Pr X → (X → Pr Y) → Pr Y
μ >>= k = λE:σY. int[X] μ (λx:X. k x E)
        = λE:σY. int[X] μ (flip k E)

Motivation: k x is the conditional probability that if x:X has been picked,
what happens to randomly picking from Y.  Integrating this over X wrt μ gives
the total probability, where μ controls picking from X.  For discrete X, it is
the very familiar and intuitive

λE:σY. Σx:X. Prob(x is picked) × Prob(E | x is picked)

We could re-do the naughty-coin example using this.

X = {the 1/2 coin, the 1/5 coin}, Y = {head, tail}
μ {the 1/2 coin} = 1/3
μ {the 1/5 coin} = 2/3
k (the 1/2 coin) = μ2
k (the 1/5 coin) = μ5

Then the total probability of head is (μ >>= k) {head}.

We now prove that join (Pr k μ) gives the same formula:

  join (Pr k μ)
= λE:σY. int (Pr k μ) (eval E)    change of variable
= λE:σY. int μ (eval E ∘ k)
= λE:σY. int μ (λx:X. k x E)
= μ >>= k

Relating this to the example, indeed Pr k μ = (1/3) × unit μ2 + (2/3) × unit μ5
as per mapping a discrete distribution to an impulse train.

By now we know the drill that every time we construct a probability measure, we
find out what happens when integrating wrt it:

Theorem:
Let μ: Pr X, k: X → Pr Y, f: Y → ℝ,±∞.
int[Y] (μ >>= k) f = int[X] μ (λx. int[Y] (k x) f)

Proof:
  int (μ >>= k) f
= int (join (Pr k μ)) f         integrate wrt a join
= int (Pr k μ) (λν. int ν f)    change of variable
= int μ ((λν. int ν f) ∘ k)
= int μ (λx. int (k x) f)

So it is a double integral again.


### Kleisli Arrow (>=>)

The formulas for (>=>) of Pr look irregular and ugly, but read on, there is a
nicely regular result at the end!

(>=>) :: (A → Pr B) → (B → Pr C) → (A → Pr C)
f >=> g = join[C] ∘ Pr g ∘ f
        = λa:A. λE:σC. int[B] (f a) (flip g E)

Well OK, it looks a bit more symmetric in uncurried Markov kernel form:

λ(a,E). int f(a,-) g(-,E)

What's nice is when we integrate wrt an ((f >=> g) a):

Theorem:
Let h: C → ℝ,±∞.
(λa:A. int ((f >=> g) a) h) = (λa:A. int (f a) (λb:B. int (g b) h))

So, just levels of the form “λx. int (foo x) ...”, which really looks like
chaining now!

Proof:

  λa. int ((f >=> g) a) h
= λa. int (join (Pr g (f a))) h         int wrt join
= λa. int (Pr g (f a)) (λμ. int μ h)    change of variable
= λa. int (f a) ((λμ. int μ h) ∘ g)
= λa. int (f a) (λb. int (g b) h)


Proofs of Satisfaction of Functor Laws and Monad Laws
-----------------------------------------------------

### Functor Laws

Reuse Hom(PowerSet(-), [0,1]) results or do it again:

  Pr id[X] μ
= μ ∘ invIm id[X]
= μ ∘ id[σX]
= μ

  (Pr g ∘ Pr f) μ
= Pr g (μ ∘ invIm f)
= μ ∘ invIm f ∘ invIm g
= μ ∘ invIm (g ∘ f)
= Pr (g ∘ f) μ


### unit is natural

Let f: X → Y.  WTP: Pr f ∘ unit[X] = unit[Y] ∘ f

  Pr f (unit x)
= λE. unit x (invIm f E)
= λE. unit x {z | f z ∈ E}
= λE. if f x ∈ E then 1 else 0
= unit (f x)


### join is natural

Let f: X → Y.  WTP: Pr f ∘ join[X] = join[Y] ∘ Pr (Pr f)

  Pr f (join ν)
= λE. join ν (invIm f E)
= λE. int ν (eval (invIm f E))
= λE. int ν (λμ. μ (invIm f E))

  join (Pr (Pr f) ν)
= λE. int (Pr (Pr f) ν) (eval E)        change of variable
= λE. int ν (eval E ∘ Pr f)
= λE. int ν (λμ. eval E (μ ∘ invIm f))
= λE. int ν (λμ. μ (invIm f E))


### unit is like identity to join

Two commuting triangles.

WTP: join[X] ∘ unit[Pr X] = id[Pr X]

  join (unit μ)
= λE. int (unit μ) (eval E)    sifting
= λE. eval E μ
= μ

WTP: join[X] ∘ Pr unit[X] = id[Pr X]

  join (Pr unit μ)
= λE. int (Pr unit μ) (eval E)    change of variable
= λE. int μ (eval E ∘ unit)
= λE. int μ (λx. unit x E)
= λE. int μ (char E)
= λE. μ E
= μ


### join is like associative

WTP: join[X] ∘ join[Pr X] = join[X] ∘ Pr join[X]

  join (join ν)
= λE. int (join ν) (eval E)
= λE. int ν (λμ. int μ (eval E))

  join (Pr join ν)
= λE. int (Pr join ν) (eval E)    change of variable
= λE. int ν (eval E ∘ join)
= λE. int ν (λμ. int μ (eval E))


Example Why We want eval's Measurable
-------------------------------------

An example when Pr X and σ(Pr X) can be easily visualized is

X = {head, tail}

μ: Pr X is completely determined by μ {head}: UI.  We have a bijection:

eval {head} : Set(Pr X, UI)
eval {head} = λμ. μ {head}

Bm: Set(UI, Pr X)
Bm = λr. λE. r × char E head + (1-r) × char E tail

(Bm stands for Bernoulli measure.)

It looks like we should make them measurable functions (morphisms in our
category) too, so that the bijection is an isomorphism.

If we are to require eval {head} to be a measurable function, it stands to
reason to extend from {head} to all measurable subsets, so that the principle is
not fine-tuned to this example.  This turns out to be just right.

Let's show that this makes Bm a measurable function.  If σ(Pr X) makes all
eval's measurable, and is the smallest possible, then MVF holds; so Bm is
measurable because for all E: σUI,

  flip Bm E
= λr. r × char E head + (1-r) × char E tail
= λr. r × some constant + (1-r) × another constant

is measurable.

Then invIm Bm and invIm (eval {head}) also form a bijection between σ(Pr X) and
σUI.  The argument is partly specific to measure theory (actually any theory
that has a similar invIm requirement, e.g., this works for topology too), partly
general in set theory:

* Specific to measure theory: The two domains σ(Pr X) and σUI actually work,
  from the definition of measurable functions.

* General set theory: If f ∘ g = id, then invIm g ∘ invIm f = id.