Representation - Bayesian Networks

Course Note Probabilistic Graphical Models (by Daphne Koller)

Jihong Ju on November 11, 2018

Bayesian network

An example

student-grade

  • Causal reasoning
  • Evidential reasoning
  • Inter-causal reasoning (“explaining away”)

Active trails

4-local-structures

Active path X-Y Z unobserved Z observed
(a) cascade
(b) cascade
(c) common parent
(d) v-structure

A trail $X_1, \dots, X_k$ is active given $Z$ if:

  • for any v-struction $X_{i-1} \rightarrow X_i \leftarrow X_{i+1}$, that $X_i$ or one of its descendents $\in Z$
  • no other $X_j \in Z$ not at that middle position of a v-structure

Independence and Factorization

A joint probability distribution P factor over a directed acylic graph G if it can be decomposed into a product of factors, as specified by G.

Factorization of a distribution P implies independies that hold in P.

X, Y independent

$(X \perp Y \vert Z)$

d-separation $\text{d-sep}_{G}{(X,Y\vert Z)}$

X and Y are d-separated in G given Z if there is no active trail in G between X and Y given Z.

If P factorizes over G, and $\text{d-sep}_{G}{(X,Y\vert Z)}$, then P satisfies $(X \perp Y \vert Z)$

General property: Any node is d-separated from its non-descendatns given its parents.

If P factors over G, then in P any variable is independent of its non-descendants given its parents.

I-maps

If P satistifies I(G), we say that G is an I-map of P.

If P factorizes over G, then G is an I-map for P. If G is an I-map for P, then P factorizes over G.

Dynamic Bayesian Network

Dynamic Bayesian Networks (DBN) are compact representation for encoding structured distributions over arbitrarily long temporal trajectories.

Markov assumption

Assuming $ (X_{t+1} \perp X_{0:t-1} \vert X_t) $, it becomes

Could be extended to semi-markov assumption to model for example velocities of robots.

Time invariance

Template transition model

Could be extended with additional variables for abrupt factors like sensor failure, etc.

Dynamic Bayesian Network

A dynamic Bayesian network over $X_1, \dots, X_n$ is defined by:

  • For $t = 0$, $ P(X_0) $
  • For $t > 0$, $ P(X_t \vert X_{t-1}) = P(X’ \vert X) = \displaystyle \prod_{i=1}^{n} P(X’ \vert \text{Parent}_{X’_i}) $

An example: Online SLAM model has unrolled (ground) model:

  • For $t = 0$, $ P(x_0) $
  • For $t > 0$, $ P(x_t, z_t \vert x_{t-1}, u_{t}, m) = P(x’, z’ \vert x, u, m) = P(x’ \vert x, u) P(z’ \vert x’, m) $

Online SLAM model is a DBN

Hidden Markov Model

To be more specific, the previous online SLAM model is a Hidden Markov Model (HMM). Hidden Markov model in its simplest format can be desribed with the following graph:

HMM example

where $S_t$ is the hidden random variable at time step $t$ and $O_t$ is the observation at time step $t$.

The transition model represents the probability of the hidden state transiting from one to another $ P(S_t \vert S_{t-1}) = P(S’ \vert S)$.

A simple example of the transition model with four possible states is:

Transition model

The outgoing edges for each node should sum up to 1, i.e., each row of the transition matrix sum up to 1.

The observation model discribes the probability of observing $O_t$ at time $t$ given the hidden state $S_t$.

Hidden Markov Model could compactly encode temporal sequences by making the assumption that the sequences are generated by a Markov process. For this reason, it is widely used in robot localization, speech recognition, text annotation, etc.

Plate Model

The plate model define a template for an infinite set of Bayesian networks. This helps generalize the same model to similari but different scenarios. For example, two universities might have varying number of students and varying number of courses. But they could share the same plate model to encode student’s grade, intelligence, course difficulty, etc.

Plate models

For a template variable $A(U_1,\dots,U_k)$ with parents $B_1(U_1),\dots,B_m(U_m)$:

Plate dependency model

For each parent node i, we mush have $U_i \in {U_1, \dots, U_k}$.

For example $G(s, c)$ cannot depend on $I(s, c, u)$.

Ground model

A template CPD $P(A \vert B_1, \dots, B_m) $ is shared by any instantiation of the model, $u_1, \dots, u_k$.

For example $P(G \vert I, D)$ can be applied on any instantiation of the student intelligence and course difficulty.