This vignette covers the whole process of Bayesian network structure learning to parameter estimation and data simulation. # Find the best fitting graphical structure using an exact search algorithm
abn
packageThe package abn
is a collection of functions for
modelling of additive Bayesian networks. It contains routines to score
Bayesian Networks based on Bayesian (default) or information-theoretic
formulation of generalized linear models. Depending on the type of data,
the package supports a possible mixture of continuous, discrete, and
count data. The following table shows which of distribution types are
supported by for each method of estimation:
Distribution type | method = "bayes" |
method = "mle" |
---|---|---|
Gaussian | ✅ | ✅ |
Binomial | ✅ | ✅ |
Poisson | ✅ | ✅ |
Multinomial | ❌ | ✅ |
Structure learning of additive Bayesian networks with
abn
is a three-step process. Based on a set of model
specifications (data, maximal number of possible parent nodes,
restricted or enforced arcs, etc.), abn
calculates in a
first step the score of the data given the model
(buildScoreCache()
). This list of scores is then used to
estimate the most probable Bayesian network structure (“structure
learning”) and to infer the network structure in a third step
(fitAbn()
). Four structure-learning algorithms have been
implemented in abn
: the hill-climbing algorithm, the “exact
search” algorithm, the simulated annealing algorithm and tabu search
algorithm. With the network structure inferred, the package provides
routines to estimate the parameters of the network and to simulate data
from the fitted additive Bayesian network model.
The following example shows how to find the best fitting graphical structure using an exact search algorithm.
ex1.dag.data
This artificial data set comes with abn
and contains
10000 observations of 10 variables. The variables are a mixture of
continuous (gaussian
), binary (binomial
), and
count (poisson
) data. The data set is a simulated data set
from a known network structure.
abn
requires a list of the type of distribution for each
node in the data set.
The max.par
argument sets the maximum number of parent
nodes for each node in the data set. It can be set to a single value for
all nodes or to a list with the node names as keys and the maximum
number of parent nodes as values. This is a crucial parameter to speed
up the model estimation in abn
as it limits the number of
possible combinations.
The score cache is a list of scores for each possible parent combination for each node in the data set. It is used to learn the structure of the Bayesian network in the next step.
mycache <- buildScoreCache(data.df = mydat,
data.dists = mydists,
method = "bayes", # the default method is "bayes"
max.parents = max.par)
The minimal number of input arguments for
buildScoreCache()
is the data set and the distribution
list. By default, the function uses the Bayesian score which is based on
the posterior probability of the model given the data. To use the
Log-Likelihood score, Akaike Information Criterion (AIC) or Bayesian
Information Criterion (BIC) instead, the method
argument
can be set to "mle"
.
The function buildScoreCache()
also accepts a list of
banned and retained arcs, which can be used to enforce or restrict the
presence of certain arcs in the network structure. This can be useful if
prior knowledge about the network structure is available, e.g. from
expert knowledge or from previous analyses it is known that certain arcs
must be present or have to be absent.
The max.parents
argument sets the maximum number of
parent nodes for each node in the data set and together with the
dag.banned
and dag.retained
arguments, it
restricts the model search space and can speed up the model estimation
in abn
.
The next step is to find the best fitting graphical structure of the
Bayesian network. In this example, we use the exact search algorithm to
find the most probable Bayesian network structure given the score cache
from the previous step. We supply the score cache as
abnCache
object from the previous step to the structure
learning function.
The mostProbable()
function returns an object of class
abnLearned
which contains the most probable Bayesian
network structure and the score of the model given the data.
The parameters of the network can be estimated using the
fitAbn()
function.
The fitAbn()
function returns an object of class
abnFit
which contains the estimated parameters of the
network.