|
Any Bayesian data analysis relies upon a generative
model, i.e. a probabilistic description of the mechanisms by which observed
data are generated. This toolbox does not invert any generative model: it has been developed to deal with a
certain class of generative models, which is described below.
Stochastic nonlinear dynamic causal models
The most general class of generative models this toolbox
can invert is called stochastic nonlinear dynamic causal models (sDCM). It is defined by a joint probability
distribution over the following set of variables:
·
y: the pXnt data time-series
·
x: the nXnt hidden states
time-series
·
x0: the nX1 initial conditions of the system
·
u: the nuXnt inputs time-series
·
θ: the nθX1 evolution parameters
·
φ: the nφX1 observation parameters
·
α: the state noise precision
·
σ: the measurement noise precision
These variables are assumed to obey the above nonlinear
state-space model:
xt+1
= f(xt,θ,ut) + ηt
yt = g(xt,φ)
+ εt
where g and f are the so-called observation and evolution functions, respectively.
In this formulation, ε and η are random
variables, namely measurement and state noise, respectively. They are
assumed to follow normal distributions of the form: ε~N(0,σ-1Qy)
and η~N(0,α-1Qx),
where Qx and Qy
are known (time dependent) covariance matrices. Note that toolbox actually
uses inverse covariance
(i.e. precision) matrices.
Additional prior assumptions (maybe noninformative)
about the statistical behaviour of x0, θ, φ, σ
and α are necessary to then finalize the
specification of the generative model. Without loss of generality, we
assume:
-
Gaussian priors for x0, θ and φ.
-
Gamma priors for σ and α.
Fixing any of these parameters simply means using infinitely
precise priors. This is done by zeroing the adequate prior variances.
|
|

|
|
|
Graphical representation of the sDCM class of generative models
|
A variational Bayesian
approach is then used to invert this generative model, under two approximations:
·
a mean-field separability
assumption between the hidden states, the initial conditions, the
evolution/observation parameters and the precision hyperparameters.
·
A Laplace approximation on all variables except precision hyperparameters
(which have Gamma posterior densities).
Thus, the variational Bayesian
updates boil down to an iterated regularized Gauss-Newton optimization
scheme. The second-order moments of the approximate posterior densities are
then simply related to the curvature of the local cost functions. Note that
the variational Bayesian update of the hidden
states is very similar in form to a Kalman
filter/smoother. This critically reduces the computational complexity of
the scheme. The so-called free energy is then used for approximate Bayesian
model comparison.
What about deterministic systems?
Deterministic DCM is simply a particular case of the sDCM class of generative models above. Deterministic
systems can be modelled by a priori assuming an infinite state noise
precision (α→∞). The states time-series xt then become an implicit function of the
evolution parameters θ. The data is thus determined solely by the
initial conditions x0 and the evolution/observation parameters
(θ and φ). This means that the identification of the such
deterministic systems is treated like a static model (see appropriate
section above) of the form:
y =
h(θ,φ,x0) + ε
where the effective observation function h is such that:
dht/dφ
= dgt/dφ
dht/dθ
= (dgt/dxt) (dxt/dθ) = (dgt/dxt)
( dft/dθ + (dft/dxt-1)
(dxt-1/dθ) )
dht/dx0
= (dgt/dxt) (dft/dxt-1)
(dft-1/dxt-2) … (df1/dx0)
where the two last lines express the
recursive dependency of the effective observation function h (at time t),
with respect to the evolution parameters θ and initial conditions x0.
Given the evolution parameters posterior density p(θ|y), an estimate of the states time series can
also be obtained:
E[xt|y] = xt(E[θ|y])
V[xt|y] = (dxt/dθ)T
V[θ|y] (dxt/dθ)
where the Laplace approximation has
been used.
Importantly, the VB treatment of deterministic DCMs does not rely upon a mean-field factorization of
the joint posterior density p(θ,φ|y)
of evolution and observation parameters. This can be useful to detect non-identifiability issues between these two sets of
parameters.
NB: by default, the toolbox’s inversion of the above sDCM generative model is initialized by inverting the
deterministic variant of the dynamical model.
Dealing with delays
Delayed dynamical systems are generally non-Markovian, in their native form. The toolbox can deal
with certain forms of delays by embedding the state space x into an
augmented state-space X, where Xt = (xt,xt-1,xt-2,…,xt-D)T. Here, D is the
maximum delay considered. This means that the dimension of the embedded
state space is now Dn, which can be considerably
high.
Alternatively, one may construct a ‘corrected’ evolution
function, using the following first-order Taylor expansion around zero-delays.
Let us assume that the system obeys the following ODE (in continuous time):
dx/dt = f(x(t-D))
Taylor-expand the evolution function around D=0:
f(x(t-D))
= f(x(t)) – D (df/dx) (dx/dt) + O(D2)
Thus:
dx/dt = ( I+D (df/dx) )-1
f(x(t)) + O(D2)
which can then be used to derive a
‘delay-corrected’ evolution function in discrete time. For example, an
Euler discretization scheme yields:
xt+1
= xt + Δt
( I+D (df/dx) )-1 f(xt,θ,ut) + ηt
where Δt
is the time discretization step and ηt lumps Δt
O(D2) error terms and any other perturbations.
Note that in this formulation, delays can now be
estimated as part of the evolution parameters (D є θ).
Extending the sDCM generative
model to serially correlated state noise
The toolbox can deal with autoregressive state noise,
using another form of embedding, whereby the AR(1)
state noise now acts as hidden states, passed through an effective
observation function that accounts for the original evolution function. In
essence, this scheme directly estimates the (stochastic) input to the
system, treating the hidden states as dummy variables, whose response to
noise is deterministic but hysteretic.
Alternatively, one can construct an augmented state
space X, where Xt = (xt,zt)T.
Here, zt is the autoregressive
component of the state noise:
Xt+1 = [ f(xt,θ,ut)
+ zt
, A zt ]T + ηt
yt = g(xt,φ)
+ εt
where the A matrix is now part of an
augmented evolution parameter vector (A є θ). The toolbox’s
default solution of the AR(1) state noise is basically
identical to the simple case where A = I, without additional (non
autoregressive) state noise ηt on
the original system’s states dynamics xt
(the corresponding block of the augmented state noise covariance matrix Qx is effectively zero).
Note that an autoregressive process of any order can be
obtained by augmenting the state space with retarded state noise (zt,zt-1,…,zt-D)T,
where D is the order of the autoregressive process. This is basically using
the above delay embedding trick.
Hierarchical sDCM generative
models
One may be willing to model a hierarchical sDCM process, for which the generative model could be:
xt+1(k) = f(k) (xt(k),θ,ut) + ηt(k)
ut(k) = g(k)(xt(k),φ)
+ εt(k)
xt+1(k-1)
= f(k-1) (xt(k-1),θ,ut(k)) + ηt(k-1)
ut(k-1) = g(k-1)(xt(k-1),φ)
+ εt(k-1)
…
xt+1(1) = f(1)(xt(1),θ,ut(2)) + ηt(1)
ut(1) = g(1) (xt(1),φ) + εt(1)
where there are k levels in the
hierarchy and the original (and only) observations are yt
= ut(1).
The toolbox does not deal with this sort of hierarchical
sDCM generative model. However, one can invert a
limiting case of this hierarchical sDCM
generative model, for which the variance of the noise processes εt(k), εt(k-1),
…, εt(2) tends to zero (Qy(k),
Qy(k-1), …, Qy(2)→0). In this case, the
above hierarchical model collapses down to an augmented non-hierarchical sDCM model of the form:
xt+1(k) = f(k) (xt(k),θ,ut) + ηt(k)
xt+1(k-1)
= f(k-1) (xt(k-1),θ,
g(k)(xt(k),φ))
+ ηt(k-1)
…
xt+1(1) = f(1)(xt(1),θ,
g(2)(xt(2),φ))
+ ηt(1)
yt
= g(1) (xt(1),φ)
+ εt(1)
Again, this is now formally equivalent to the dynamical
embedding procedure that was referred above for dealing with delays and
serially correlated state noise, where the augmented state space is now Xt = (xt(k),xt(k-1), …,xt(1))T
.
Getting closer to continuous dynamical systems
The sDCM evolution equation is
inherently discrete in time. If one uses this as an approximation to a
continuous dynamical system, then the goodness of the approximation
strongly depends on the sampling frequency. The toolbox allows one to
specify a ‘microtime resolution’, which is used
to recursively apply the evolution function between two time samples. This
can be useful to minimize the time discretization
errors introduced when approximating the original continuous dynamical
system. For example, let us assume that the system obeys the following ODE:
dx/dt = f(x)
We wish to derive a prediction from x(t)
to the next time sample, x(t+Δt).
One simple solution to this is to use any time discretization and apply recursively it on a smaller
time grid. For example, if one uses an Euler scheme:
x(t+δ) = F(x(t))
where F(x(t)) = x(t) + δ f(x(t))
and δ is a small integration step (δ<<Δt).
Then the recursive application of the Euler evolution function F yields:
x(t+Δt)
= F • F • ... • F(x(t))
dx(t+Δt)/dx(t) = (dF/dx) (dF/dx) ... (dF/dx)
where the function F and its gradient
is evaluated recursively (Δt/δ times)
on the ‘microtime’ resolution grid defined by
δ.
In the limit (Δt/δ>>1),
this recursive Euler scheme tends to a matrix exponential flow, i.e.: F(x(t)) = x(t) + (dF/dx)-1
( exp[ Δt (dF/dx)
] – I ) f(x(t)). Note that this procedure does not allow state noise to
enter anywhere else than at the times where data are sampled. This means
that the impact of state noise will be somehow underestimated.
Dealing with binary observations
The toolbox can deal with binary data, by treating them
as binomial (Bernouilli) samples, whose sufficient
statistics (expectation) is given by the observation function. This means
that the observation equation is basically replaced by the following
binomial log-likelihood function:
log P(yt|xt,φ) = yt log(g(xt,φ))
+ (1-yt) log(1-g(xt,φ))
where g(xt,φ)
= P(yt=1|xt,φ) = E[yt|xt,φ], by definition.
Note that here, the measurement
noise precision σ and covariance components Qy
are irrelevant.
It turns out that this does not affect much the variational Bayesian inversion scheme under the Laplace
approximation. This is because the posterior density on
evolution/observation parameters, hidden states and initial conditions is
still asymptotically Gaussian.
Note also that when the dimension of the data is high
enough (pnt>>1), the empirical
distribution of the ‘residuals’ εt
= yt - g(xt,φ) will tend to a Normal density.
This means that one can then approximate the above likelihood with a
Gaussian density with mean g(xt,φ)
and (unknown) variance σ-1, which falls back into the sDCM class of generative models (this result is
actually used during the initialization of the algorithm).
Inverting static (hierarchical) models
Static models are simplifications of the above sDCM class of generative models, where the dimension of
the hidden states and the evolution parameters tend to zero (nt=1, n→0 and nθ→0).
This reduces the model to a nonlinear observation equation:
y = g(φ)
+ ε
with fixed priors on the observation
parameters.
A simple (2 levels) hierarchical extension of this
static model (whereby the priors are also learned) can easily be derived by
actually removing the evolution/observation parameters, but retaining the
initial conditions and the first hidden states, with, e.g., identity
evolution function:
x = x0
+ η
y = g(x) +
ε
where x, x0, σ and
α are estimated.
As many hierarchical levels as necessary (e.g., t>1
levels) can be obtained by collapsing the higher levels onto the next to
last level. For example, the following hierarchical (linear) generative
model
x1 = A1 x0 + η1
…
xt-1
= At-1 xt-2 + ηt-1
xt =
At xt-1 + ηt
yt =
g(xt,φ) + εt
where At are known matrices, can be rewritten
as:
xt = At At-1 … A1
x0 + ηt + At
ηt-1 + At At-1 ηt-1 +
At At-1 … A1 η1
yt = g(xt,φ)
+ εt
which is just another particular case
of the above generative model, with linear evolution function f(x0)
= At At-1 … A1 x0 and
appropriate state covariance matrix, e.g., Qx
= (I + At + At At-1 + At At-1
… A1)2 if one originally assumed i.i.d. state noise.
Alternatively, one can also retain the ‘time-series’
structure of the sDCM generative model, and use
infinite measurement noise covariance components (Qy→∞)
up to the next to last ‘time’, to arrive at the same effective generative
model (whereby only data at the last level will be considered for
inference).
Note that probabilistic PCA, ICA and factor analyses
obtain from retaining both the states and the observation parameters, which
then encode the unknown so-called ‘mixing matrix’ Φ:
x = f(x0)
+ η
y = Φx + ε
where the observation function has
been restricted to the bilinear mapping g(x,φ)=
Φx, and Φ is simply the matrix whose
elements are composed of the vector of (observation) parameters φ.
These data decomposition methods then effectively correspond to different
priors and evolution functions (e.g. diagonal prior covariance component
for factor analysis, sigmoidal evolution function
for non-Gaussian sources in ICA, etc…).
|