Time Series

Dynamic Bayesian Networks are a general class of state-space time series model encompassing as special cases

Definition

A Bayesian network is a directed acyclic graph G = (V, E) where each node x ∈ V has a conditional probability distribution associated such that the joint probability on V is

\Pr{V} = \prod_{x ∈ V} \Prc{x}{π_x}

where π_x denotes the parents of node x.

A dynamic Bayesian Network is a pair (B_0, B_t) where B_0 is a Bayesian network representing the initial distribution (i.e. at time 0) and B_t is the transition network on the same nodes V but with a different set of edges allowing cycles and self-loops. Define with τ_x the ancestors in the transition network.

Now replicate V for each time t so we have V_0, V_1, … all containing the same nodes. Similarly we have x_0, x_1.

\Pr{V_0} = \prod_{x_0 ∈ V_0} \Prc{x_0}{π_{x_0}}

\Pr{V_t}{V_{t-1}} = \prod_{x_t ∈ V_t} \Prc{x_t}{π_{x_t}, τ_{x_{t-1}}}

Tracing:

\Pr{V_{0:T}} = \Pr{V_0} ⋅ \Pr{V_{1:T}} = \prod_{x ∈ V} \Prc{x}{π_x} ⋅ \prod_{t ∈ 1:T} \prod_{x_t ∈ V_t} \Prc{x_t}{π_{x_t}, τ_{x_{t-1}}}

Alternative. A special case is when x_t only depends on x_{t-1} and other nodes at time t. So history propagates directly. In this case E_t only contains self-loops.

this leads to

\Prc{V_t}{V_{t-1}} = \prod_{x ∈ V} \prod_{π_x ∈ V} \Prc{x_t}{π_{x_t}}

Note. This model is Markovian in that temporal relations are only between t and t+1. If different lags are required we can add these as state variables. (Or modify the model, but state seems more meaningful).

Partial observation. Only a subset of V is observable.

Learning structure

It is possible to learn the structure of the graph from the data.

For learning the base structure we can use all the available data for each variable, ignoring the temporal information. This is equivalent to learning a BN. For learning the transition network we consider the temporal information, in particular the data for all variables in two consecutive time slices, X_t and X_{t+1}. Considering the base structure, we can then learn the dependencies between the variables at time t and t+1. [source]

Learning distributions parameters: CMA-ES on cost function

In contracts with the 'Expectation Maximization' method, the model is not optimized for distribution fit, but for a specified cost function. This makes the learning less sensitive to modelling errors.

One cost function is maximum likelihood.

Special case: Kalman filters

\vec x_{t+1} = A ⋅ \vec x_t + \vec c_t + \vec w_t

\vec z_{t} = H ⋅ \vec x_t + \vec v_t

With noise vectors \vec w_t ∼ \mathcal{N}(0, Q_t) and \vec v_t ∼ \mathcal{N}(0, R_t).

\Prc{\vec x_{t+1}}{\vec x_t} \sim \mathcal{N}(A ⋅ \vec x_t, Q)

\Prc{\vec z_t}{\vec x_t} \sim \mathcal{N}(H ⋅ \vec x_t, R)

Special case: ARIMA

https://multithreaded.stitchfix.com/blog/2016/04/21/forget-arima/

Y_t = \mu_t + x_t \beta + S_t + e_t

\mu_{t+1} = \mu_t + v_t

References

To do. How does this relate to https://en.wikipedia.org/wiki/Bayesian_structural_time_series?

https://courses.cs.washington.edu/courses/cse515/09sp/slides/varel.pdf

https://ethz.ch/content/dam/ethz/special-interest/mtec/chair-of-entrepreneurial-risks-dam/documents/dissertation/master%20thesis/Master_Thesis_%20Morzywolek.pdf

https://github.com/jsyoon0823/TimeGAN

http://isomorphisms.sdf.org/maxdama.pdf

Remco Bloemen
Math & Engineering
https://2π.com