Time Series

Dynamic Bayesian Networks are a general class of state-space time series model encompassing as special cases

Definition

A Bayesian network is a directed acyclic graph $G = (V, E)$ where each node $x ∈ V$ has a conditional probability distribution associated such that the joint probability on $V$ is

$$ \Pr{V} = \prod_{x ∈ V} \Prc{x}{π_x} $$

where $π_x$ denotes the parents of node $x$.

A dynamic Bayesian Network is a pair $(B_0, B_t)$ where $B_0$ is a Bayesian network representing the initial distribution (i.e. at time $0$) and $B_t$ is the transition network on the same nodes $V$ but with a different set of edges allowing cycles and self-loops. Define with $τ_x$ the ancestors in the transition network.

Now replicate $V$ for each time $t$ so we have $V_0, V_1, …$ all containing the same nodes. Similarly we have $x_0, x_1$.

$$ \Pr{V_0} = \prod_{x_0 ∈ V_0} \Prc{x_0}{π_{x_0}} $$

$$ \Pr{V_t}{V_{t-1}} = \prod_{x_t ∈ V_t} \Prc{x_t}{π_{x_t}, τ_{x_{t-1}}} $$

Tracing:

$$ \Pr{V_{0:T}} = \Pr{V_0} ⋅ \Pr{V_{1:T}} = \prod_{x ∈ V} \Prc{x}{π_x} ⋅ \prod_{t ∈ 1:T} \prod_{x_t ∈ V_t} \Prc{x_t}{π_{x_t}, τ_{x_{t-1}}} $$

Alternative. A special case is when $x_t$ only depends on $x_{t-1}$ and other nodes at time $t$. So history propagates directly. In this case $E_t$ only contains self-loops.

this leads to

$$ \Prc{V_t}{V_{t-1}} = \prod_{x ∈ V} \prod_{π_x ∈ V} \Prc{x_t}{π_{x_t}} $$

Note. This model is Markovian in that temporal relations are only between $t$ and $t+1$. If different lags are required we can add these as state variables. (Or modify the model, but state seems more meaningful).

Partial observation. Only a subset of $V$ is observable.

Learning structure

It is possible to learn the structure of the graph from the data.

For learning the base structure we can use all the available data for each variable, ignoring the temporal information. This is equivalent to learning a BN. For learning the transition network we consider the temporal information, in particular the data for all variables in two consecutive time slices, $X_t$ and $X_{t+1}$. Considering the base structure, we can then learn the dependencies between the variables at time $t$ and $t+1$. [source]

Learning distributions parameters: CMA-ES on cost function

In contracts with the 'Expectation Maximization' method, the model is not optimized for distribution fit, but for a specified cost function. This makes the learning less sensitive to modelling errors.

One cost function is maximum likelihood.

Special case: Kalman filters

$$ \vec x_{t+1} = A ⋅ \vec x_t + \vec c_t + \vec w_t $$

$$ \vec z_{t} = H ⋅ \vec x_t + \vec v_t $$

With noise vectors $\vec w_t ∼ \mathcal{N}(0, Q_t)$ and $\vec v_t ∼ \mathcal{N}(0, R_t)$.

$$ \Prc{\vec x_{t+1}}{\vec x_t} \sim \mathcal{N}(A ⋅ \vec x_t, Q) $$

$$ \Prc{\vec z_t}{\vec x_t} \sim \mathcal{N}(H ⋅ \vec x_t, R) $$

Special case: ARIMA

https://multithreaded.stitchfix.com/blog/2016/04/21/forget-arima/

$$ Y_t = \mu_t + x_t \beta + S_t + e_t $$

$$ \mu_{t+1} = \mu_t + v_t $$

References

To do. How does this relate to https://en.wikipedia.org/wiki/Bayesian_structural_time_series?

https://courses.cs.washington.edu/courses/cse515/09sp/slides/varel.pdf

https://ethz.ch/content/dam/ethz/special-interest/mtec/chair-of-entrepreneurial-risks-dam/documents/dissertation/master%20thesis/Master_Thesis_%20Morzywolek.pdf

https://github.com/jsyoon0823/TimeGAN

http://isomorphisms.sdf.org/maxdama.pdf

Remco Bloemen
Math & Engineering
https://2π.com