Title: | Modelling Multivariate Data with Additive Bayesian Networks |
---|---|
Description: | The 'abn' R package facilitates Bayesian network analysis, a probabilistic graphical model that derives from empirical data a directed acyclic graph (DAG). This DAG describes the dependency structure between random variables. The R package 'abn' provides routines to help determine optimal Bayesian network models for a given data set. These models are used to identify statistical dependencies in messy, complex data. Their additive formulation is equivalent to multivariate generalised linear modelling, including mixed models with independent and identically distributed (iid) random effects. The core functionality of the 'abn' package revolves around model selection, also known as structure discovery. It supports both exact and heuristic structure learning algorithms and does not restrict the data distribution of parent-child combinations, providing flexibility in model creation and analysis. The 'abn' package uses Laplace approximations for metric estimation and includes wrappers to the 'INLA' package. It also employs 'JAGS' for data simulation purposes. For more resources and information, visit the 'abn' website. |
Authors: | Matteo Delucchi [aut, cre] , Reinhard Furrer [aut] , Gilles Kratzer [aut] , Fraser Iain Lewis [aut] , Jonas I. Liechti [ctb] , Marta Pittavino [ctb] , Kalina Cherneva [ctb] |
Maintainer: | Matteo Delucchi <[email protected]> |
License: | GPL (>= 3) |
Version: | 3.1.3 |
Built: | 2024-11-23 06:23:01 UTC |
Source: | https://github.com/furrer-lab/abn |
abnFit
Print AIC of objects of class abnFit
## S3 method for class 'abnFit' AIC(object, digits = 3L, verbose = TRUE, ...)
## S3 method for class 'abnFit' AIC(object, digits = 3L, verbose = TRUE, ...)
object |
Object of class |
digits |
number of digits of the results. |
verbose |
print additional output. |
... |
additional parameters. Not used at the moment. |
prints the AIC of the fitted model.
abnFit
Print BIC of objects of class abnFit
## S3 method for class 'abnFit' BIC(object, digits = 3L, verbose = TRUE, ...)
## S3 method for class 'abnFit' BIC(object, digits = 3L, verbose = TRUE, ...)
object |
Object of class |
digits |
number of digits of the results. |
verbose |
print additional output. |
... |
additional parameters. Not used at the moment. |
prints the BIC of the fitted model.
buildScoreCache
Allow the user to set restrictions in the buildScoreCache
for both the Bayesian and the MLE approach.
Control function similar to fit.control
.
build.control( method = "bayes", max.mode.error = 10, mean = 0, prec = 0.001, loggam.shape = 1, loggam.inv.scale = 5e-05, max.iters = 100, epsabs = 1e-07, error.verbose = FALSE, trace = 0L, epsabs.inner = 1e-06, max.iters.inner = 100, finite.step.size = 1e-07, hessian.params = c(1e-04, 0.01), max.iters.hessian = 10, max.hessian.error = 0.5, factor.brent = 100, maxiters.hessian.brent = 100, num.intervals.brent = 100, n.grid = 250, ncores = 1, cluster.type = "FORK", max.irls = 100, tol = 1e-08, tolPwrss = 1e-07, check.rankX = "message+drop.cols", check.scaleX = "message+rescale", check.conv.grad = "message", check.conv.singular = "message", check.conv.hess = "message", xtol_abs = 1e-06, ftol_abs = 1e-06, trace.mblogit = FALSE, catcov.mblogit = "free", epsilon = 1e-06, seed = 9062019L )
build.control( method = "bayes", max.mode.error = 10, mean = 0, prec = 0.001, loggam.shape = 1, loggam.inv.scale = 5e-05, max.iters = 100, epsabs = 1e-07, error.verbose = FALSE, trace = 0L, epsabs.inner = 1e-06, max.iters.inner = 100, finite.step.size = 1e-07, hessian.params = c(1e-04, 0.01), max.iters.hessian = 10, max.hessian.error = 0.5, factor.brent = 100, maxiters.hessian.brent = 100, num.intervals.brent = 100, n.grid = 250, ncores = 1, cluster.type = "FORK", max.irls = 100, tol = 1e-08, tolPwrss = 1e-07, check.rankX = "message+drop.cols", check.scaleX = "message+rescale", check.conv.grad = "message", check.conv.singular = "message", check.conv.hess = "message", xtol_abs = 1e-06, ftol_abs = 1e-06, trace.mblogit = FALSE, catcov.mblogit = "free", epsilon = 1e-06, seed = 9062019L )
method |
a character that takes one of two values: "bayes" or "mle". Overrides |
max.mode.error |
if the estimated modes from INLA differ by a factor of |
mean |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec |
the prior precision ( |
loggam.shape |
the shape parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
loggam.inv.scale |
the inverse scale parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
max.iters |
total number of iterations allowed when estimating the modes in Laplace approximation. passed to |
epsabs |
absolute error when estimating the modes in Laplace approximation for models with no random effects. Passed to |
error.verbose |
logical, additional output in the case of errors occurring in the optimization. Passed to |
trace |
Non-negative integer. If positive, tracing information on the progress of the "L-BFGS-B" optimization is produced. Higher values may produce more tracing information. (There are six levels of tracing. To understand exactly what these do see the source code.). Passed to |
epsabs.inner |
absolute error in the maximization step in the (nested) Laplace approximation for each random effect term. Passed to |
max.iters.inner |
total number of iterations in the maximization step in the nested Laplace approximation. Passed to |
finite.step.size |
suggested step length used in finite difference estimation of the derivatives for the (outer) Laplace approximation when estimating modes. Passed to |
hessian.params |
a numeric vector giving parameters for the adaptive algorithm, which determines the optimal stepsize in the finite-difference estimation of the hessian. First entry is the initial guess, second entry absolute error. Passed to |
max.iters.hessian |
integer, maximum number of iterations to use when determining an optimal finite difference approximation (Nelder-Mead). Passed to |
max.hessian.error |
if the estimated log marginal likelihood when using an adaptive 5pt finite-difference rule for the Hessian differs by more than |
factor.brent |
if using Brent-Dekker root bracketing method then define the outer most interval end points as the best estimate of |
maxiters.hessian.brent |
maximum number of iterations allowed in the Brent-Dekker method. Passed to |
num.intervals.brent |
the number of initial different bracket segments to try in the Brent-Dekker method. Passed to |
n.grid |
recompute density on an equally spaced grid with |
ncores |
The number of cores to parallelize to, see ‘Details’. If >0, the number of CPU cores to be used. -1 for all available -1 core. Only for |
cluster.type |
The type of cluster to be used, see |
max.irls |
total number of iterations for estimating network scores using an Iterative Reweighed Least Square algorithm. Is this DEPRECATED? |
tol |
real number giving the minimal tolerance expected to terminate the Iterative Reweighed Least Square algorithm to estimate network score. Passed to |
tolPwrss |
numeric scalar passed to |
check.rankX |
character passed to |
check.scaleX |
character passed to |
check.conv.grad |
character passed to |
check.conv.singular |
character passed to |
check.conv.hess |
character passed to |
xtol_abs |
Defaults to 1e-6 stop on small change of parameter value. Only for |
ftol_abs |
Defaults to 1e-6 stop on small change in deviance. Similar to |
trace.mblogit |
logical indicating if output should be produced for each iteration. Directly passed to |
catcov.mblogit |
Defaults to "free" meaning that there are no restrictions on the covariances of random effects between the logit equations. Set to "diagonal" if random effects pertinent to different categories are uncorrelated or "single" if random effect variances pertinent to all categories are identical. |
epsilon |
Defaults to 1e-8. Positive convergence tolerance |
seed |
a non-negative integer which sets the seed in |
Parallelization over all children is possible via the function foreach
of the package doParallel. ncores=0
or ncores=1
use single threaded foreach
. ncores=-1
uses all available cores but one.
Named list according the provided arguments.
Other buildScoreCache:
buildScoreCache()
ctrlmle <- abn::build.control(method = "mle", ncores = 0, cluster.type = "PSOCK", max.irls = 100, tol = 10^-11, tolPwrss = 1e-7, check.rankX = "message+drop.cols", check.scaleX = "message+rescale", check.conv.grad = "message", check.conv.singular = "message", check.conv.hess = "message", xtol_abs = 1e-6, ftol_abs = 1e-6, trace.mblogit = FALSE, catcov.mblogit = "free", epsilon = 1e-6, seed = 9062019L) ctrlbayes <- abn::build.control(method = "bayes", max.mode.error = 10, mean = 0, prec = 0.001, loggam.shape = 1, loggam.inv.scale = 5e-05, max.iters = 100, epsabs = 1e-07, error.verbose = FALSE, epsabs.inner = 1e-06, max.iters.inner = 100, finite.step.size = 1e-07, hessian.params = c(1e-04, 0.01), max.iters.hessian = 10, max.hessian.error = 0.5, factor.brent = 100, maxiters.hessian.brent = 100, num.intervals.brent = 100, tol = 10^-8, seed = 9062019L)
ctrlmle <- abn::build.control(method = "mle", ncores = 0, cluster.type = "PSOCK", max.irls = 100, tol = 10^-11, tolPwrss = 1e-7, check.rankX = "message+drop.cols", check.scaleX = "message+rescale", check.conv.grad = "message", check.conv.singular = "message", check.conv.hess = "message", xtol_abs = 1e-6, ftol_abs = 1e-6, trace.mblogit = FALSE, catcov.mblogit = "free", epsilon = 1e-6, seed = 9062019L) ctrlbayes <- abn::build.control(method = "bayes", max.mode.error = 10, mean = 0, prec = 0.001, loggam.shape = 1, loggam.inv.scale = 5e-05, max.iters = 100, epsabs = 1e-07, error.verbose = FALSE, epsabs.inner = 1e-06, max.iters.inner = 100, finite.step.size = 1e-07, hessian.params = c(1e-04, 0.01), max.iters.hessian = 10, max.hessian.error = 0.5, factor.brent = 100, maxiters.hessian.brent = 100, num.intervals.brent = 100, tol = 10^-8, seed = 9062019L)
Simple check on the control parameters
check.valid.fitControls(control, method = "bayes", verbose = FALSE)
check.valid.fitControls(control, method = "bayes", verbose = FALSE)
control |
list of control arguments with new parameters supplied to |
method |
"bayes" or "mle" strategy from argument |
verbose |
when TRUE additional information is printed. Defaults to FALSE. |
list with all control arguments with respect to the method but with new values.
abnFit
Print coefficients of objects of class abnFit
## S3 method for class 'abnFit' coef(object, digits = 3L, verbose = TRUE, ...)
## S3 method for class 'abnFit' coef(object, digits = 3L, verbose = TRUE, ...)
object |
Object of class |
digits |
number of digits of the results. |
verbose |
print additional output. |
... |
additional parameters. Not used at the moment. |
prints the coefficients of the fitted model.
Function that returns multiple graph metrics to compare two DAGs or essential graphs, known as confusion matrix or error matrix.
compareDag(ref, test, node.names = NULL, checkDAG = TRUE)
compareDag(ref, test, node.names = NULL, checkDAG = TRUE)
ref |
a matrix or a formula statement (see details for format) defining
the reference network structure, a directed acyclic graph (DAG).
Note that row names must be set or given in |
test |
a matrix or a formula statement (see details for format) defining
the test network structure, a directed acyclic graph (DAG).
Note that row names must be set or given in |
node.names |
a vector of names if the DAGs are given via formula, see details. |
checkDAG |
should the DAGs be tested for DAGs (default). |
This R function returns standard Directed Acyclic Graph comparison metrics. In statistical classification, those metrics are known as a confusion matrix or error matrix.
Those metrics allows visualization of the difference between different DAGs. In the case where comparing TRUTH to learned structure or two learned structures, those metrics allow the user to estimate the performance of the learning algorithm. In order to compute the metrics, a contingency table is computed of a pondered difference of the adjacency matrices od the two graphs.
The ref
or test
can be provided using a formula statement
(similar to GLM input).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1 has two parents (parent1 and parent2).
node2 and node3 have the same parent3.
The parents names have to exactly match those given in node.names
.
:
is the separtor between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in node.names
.
To test for essential graphs (or graphs) in general, the test for DAG
need to be switched off checkDAG=FALSE
.
The function compareEG()
is a wrapper to compareDag(, checkDAG=FALSE)
.
TP
True Positive
TN
True Negative
FP
False Positive
FN
False Negative
CP
Condition Positive (ref)
CN
Condition Negative (ref)
PCP
Predicted Condition Positive (test)
PCN
Predicted Condition Negative (test)
True Positive Rate
False Positive Rate
Accuracy
G-measure
F1-Score
Positive Predictive Value
False Ommision Rate
Hamming-Distance
Number of changes needed to match the matrices.
Sammut, Claude, and Geoffrey I. Webb. (2017). Encyclopedia of machine learning and data mining. Springer.
test.m <- matrix(data = c(0,1,0, 0,0,0, 1,0,0), nrow = 3, ncol = 3) ref.m <- matrix(data = c(0,0,0, 1,0,0, 1,0,0), nrow = 3, ncol = 3) colnames(test.m) <- rownames(test.m) <- colnames(ref.m) <- colnames(ref.m) <- c("a", "b", "c") unlist(compareDag(ref = ref.m, test = test.m))
test.m <- matrix(data = c(0,1,0, 0,0,0, 1,0,0), nrow = 3, ncol = 3) ref.m <- matrix(data = c(0,0,0, 1,0,0, 1,0,0), nrow = 3, ncol = 3) colnames(test.m) <- rownames(test.m) <- colnames(ref.m) <- colnames(ref.m) <- c("a", "b", "c") unlist(compareDag(ref = ref.m, test = test.m))
Function that returns multiple graph metrics to compare two DAGs or essential graphs, known as confusion matrix or error matrix.
compareEG(ref, test)
compareEG(ref, test)
ref |
a matrix or a formula statement (see details for format) defining
the reference network structure, a directed acyclic graph (DAG).
Note that row names must be set or given in |
test |
a matrix or a formula statement (see details for format) defining
the test network structure, a directed acyclic graph (DAG).
Note that row names must be set or given in |
This R function returns standard Directed Acyclic Graph comparison metrics. In statistical classification, those metrics are known as a confusion matrix or error matrix.
Those metrics allows visualization of the difference between different DAGs. In the case where comparing TRUTH to learned structure or two learned structures, those metrics allow the user to estimate the performance of the learning algorithm. In order to compute the metrics, a contingency table is computed of a pondered difference of the adjacency matrices od the two graphs.
The ref
or test
can be provided using a formula statement
(similar to GLM input).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1 has two parents (parent1 and parent2).
node2 and node3 have the same parent3.
The parents names have to exactly match those given in node.names
.
:
is the separtor between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in node.names
.
To test for essential graphs (or graphs) in general, the test for DAG
need to be switched off checkDAG=FALSE
.
The function compareEG()
is a wrapper to compareDag(, checkDAG=FALSE)
.
TP
True Positive
TN
True Negative
FP
False Positive
FN
False Negative
CP
Condition Positive (ref)
CN
Condition Negative (ref)
PCP
Predicted Condition Positive (test)
PCN
Predicted Condition Negative (test)
True Positive Rate
False Positive Rate
Accuracy
G-measure
F1-Score
Positive Predictive Value
False Ommision Rate
Hamming-Distance
Number of changes needed to match the matrices.
Sammut, Claude, and Geoffrey I. Webb. (2017). Encyclopedia of machine learning and data mining. Springer.
test.m <- matrix(data = c(0,1,0, 0,0,0, 1,0,0), nrow = 3, ncol = 3) ref.m <- matrix(data = c(0,0,0, 1,0,0, 1,0,0), nrow = 3, ncol = 3) colnames(test.m) <- rownames(test.m) <- colnames(ref.m) <- colnames(ref.m) <- c("a", "b", "c") unlist(compareDag(ref = ref.m, test = test.m))
test.m <- matrix(data = c(0,1,0, 0,0,0, 1,0,0), nrow = 3, ncol = 3) ref.m <- matrix(data = c(0,0,0, 1,0,0, 1,0,0), nrow = 3, ncol = 3) colnames(test.m) <- rownames(test.m) <- colnames(ref.m) <- colnames(ref.m) <- c("a", "b", "c") unlist(compareDag(ref = ref.m, test = test.m))
This function discretizes a data frame of possibly continuous random variables through rules for discretization. The discretization algorithms are unsupervised and univariate. See details for the complete list of discretization rules (the number of state of each random variable could also be provided).
discretization(data.df = NULL, data.dists = NULL, discretization.method = "sturges", nb.states = FALSE)
discretization(data.df = NULL, data.dists = NULL, discretization.method = "sturges", nb.states = FALSE)
data.df |
a data frame containing the data to discretize, binary and multinomial variables must be declared as factors, others as a numeric vector. The data frame must be named. |
data.dists |
a named list giving the distribution for each node in the network. |
discretization.method |
a character vector giving the discretization method to use; see details. If a number is provided, the variable will be discretized by equal binning. |
nb.states |
logical variable to select the output. If set to |
fd
Freedman Diaconis rule. IQR()
stands for interquartile range.
The number of bins is given by
The Freedman Diaconis rule is known to be less sensitive than the Scott's rule to outlier.
doane
Doane's rule.
The number of bins is given by
This is a modification of Sturges' formula, which attempts to improve its performance with non-normal data.
sqrt
The number of bins is given by:
cencov
Cencov's rule.
The number of bins is given by:
rice
Rice' rule.
The number of bins is given by:
terrell-scott
Terrell-Scott's rule.
The number of bins is given by:
It is known that Cencov, Rice, and Terrell-Scott rules over-estimates k, compared to other rules due to its simplicity.
sturges
Sturges's rule.
The number of bins is given by:
scott
Scott's rule.
The number of bins is given by:
The discretized data frame or a list containing the table of counts for each bin the discretized data frame.
table of counts for each bin of the discretized data frame.
Garcia, S., et al. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25.4, 734-750.
Cebeci, Z. and Yildiz, F. (2017). Unsupervised Discretization of Continuous Variables in a Chicken Egg Quality Traits Dataset. Turkish Journal of Agriculture-Food Science and Technology, 5.4, 315-320.
## Generate random variable rv <- rnorm(n = 100, mean = 5, sd = 2) dist <- list("gaussian") names(dist) <- c("rv") ## Compute the entropy through discretization entropyData(freqs.table = discretization(data.df = rv, data.dists = dist, discretization.method = "sturges", nb.states = FALSE))
## Generate random variable rv <- rnorm(n = 100, mean = 5, sd = 2) dist <- list("gaussian") names(dist) <- c("rv") ## Compute the entropy through discretization entropyData(freqs.table = discretization(data.df = rv, data.dists = dist, discretization.method = "sturges", nb.states = FALSE))
This function empirically estimates the Shannon entropy from a table of counts using the observed frequencies.
entropyData(freqs.table)
entropyData(freqs.table)
freqs.table |
a table of counts. |
The general concept of entropy is defined for probability distributions.
The entropyData()
function estimates empirical entropy from data.
The probability is estimated from data using frequency tables.
Then the estimates are plug-in in the definition of the entropy to return
the so-called empirical entropy. A common known problem of empirical entropy
is that the estimations are biased due to the sampling noise.
It is also known that the bias will decrease as the sample size increases.
Shannon's entropy estimate on natural logarithm scale.
integer
Cover, Thomas M, and Joy A Thomas. (2012). "Elements of Information Theory". John Wiley & Sons.
## Generate random variable rv <- rnorm(n = 100, mean = 5, sd = 2) dist <- list("gaussian") names(dist) <- c("rv") ## Compute the entropy through discretization entropyData(freqs.table = discretization(data.df = rv, data.dists = dist, discretization.method = "sturges", nb.states = FALSE))
## Generate random variable rv <- rnorm(n = 100, mean = 5, sd = 2) dist <- list("gaussian") names(dist) <- c("rv") ## Compute the entropy through discretization entropyData(freqs.table = discretization(data.df = rv, data.dists = dist, discretization.method = "sturges", nb.states = FALSE))
Constructs different versions of the essential graph from a given DAG. External function that computes essential graph of a dag Minimal PDAG: The only directed edges are those who participate in v-structure Completed PDAG: very directed edge corresponds to a compelled edge, and every undirected edge corresponds to a reversible edge
essentialGraph(dag, node.names = NULL, PDAG = "minimal")
essentialGraph(dag, node.names = NULL, PDAG = "minimal")
dag |
a matrix or a formula statement (see ‘Details’ for format) defining the network structure, a directed acyclic graph (DAG). |
node.names |
a vector of names if the DAG is given via formula, see ‘Details’. |
PDAG |
a character value that can be: minimal or complete, see ‘Details’. |
This function returns an essential graph from a DAG, aka acyclic partially directed graph (PDAG). This can be useful if the learning procedure is defined up to a Markov class of equivalence. A minimal PDAG is defined as only directed edges are those who participate in v-structure. Whereas the completed PDAG: every directed edge corresponds to a compelled edge, and every undirected edge corresponds to a reversible edge.
The dag
can be provided using a formula statement (similar to glm).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1 has two parents (parent1 and parent2).
node2 and node3 have the same parent3.
The parents names have to exactly match those given in node.names
.
:
is the separator between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in node.names
.
A matrix giving the PDAG.
West, D. B. (2001). Introduction to Graph Theory. Vol. 2. Upper Saddle River: Prentice Hall. Chickering, D. M. (2013) A Transformational Characterization of Equivalent Bayesian Network Structures, arXiv:1302.4938.
dag <- matrix(c(0,0,0, 1,0,0, 1,1,0), nrow = 3, ncol = 3) dist <- list(a="gaussian", b="gaussian", c="gaussian") colnames(dag) <- rownames(dag) <- names(dist) essentialGraph(dag)
dag <- matrix(c(0,0,0, 1,0,0, 1,1,0), nrow = 3, ncol = 3) dist <- list(a="gaussian", b="gaussian", c="gaussian") colnames(dag) <- rownames(dag) <- names(dist) essentialGraph(dag)
See also the C implementation ?abn::expit_cpp()
.
expit(x)
expit(x)
x |
numeric with values between |
numeric vector of same length as x
.
transform x
either via the logit, or expit.
expit_cpp(x)
expit_cpp(x)
x |
a numeric vector |
a numeric vector
abnFit
Print family of objects of class abnFit
## S3 method for class 'abnFit' family(object, ...)
## S3 method for class 'abnFit' family(object, ...)
object |
Object of class |
... |
additional parameters. Not used at the moment. |
prints the distributions for each variable of the fitted model.
fitAbn
Allow the user to set restrictions in the fitAbn
for both the Bayesian and the MLE approach.
Control function similar to build.control
.
fit.control( method = "bayes", max.mode.error = 10, mean = 0, prec = 0.001, loggam.shape = 1, loggam.inv.scale = 5e-05, max.iters = 100, epsabs = 1e-07, error.verbose = FALSE, trace = 0L, epsabs.inner = 1e-06, max.iters.inner = 100, finite.step.size = 1e-07, hessian.params = c(1e-04, 0.01), max.iters.hessian = 10, max.hessian.error = 1e-04, factor.brent = 100, maxiters.hessian.brent = 10, num.intervals.brent = 100, min.pdf = 0.001, n.grid = 250, std.area = TRUE, marginal.quantiles = c(0.025, 0.25, 0.5, 0.75, 0.975), max.grid.iter = 1000, marginal.node = NULL, marginal.param = NULL, variate.vec = NULL, ncores = 1, cluster.type = "FORK", max.irls = 100, tol = 1e-11, tolPwrss = 1e-07, check.rankX = "message+drop.cols", check.scaleX = "message+rescale", check.conv.grad = "message", check.conv.singular = "message", check.conv.hess = "message", xtol_abs = 1e-06, ftol_abs = 1e-06, trace.mblogit = FALSE, catcov.mblogit = "free", epsilon = 1e-06, seed = 9062019L )
fit.control( method = "bayes", max.mode.error = 10, mean = 0, prec = 0.001, loggam.shape = 1, loggam.inv.scale = 5e-05, max.iters = 100, epsabs = 1e-07, error.verbose = FALSE, trace = 0L, epsabs.inner = 1e-06, max.iters.inner = 100, finite.step.size = 1e-07, hessian.params = c(1e-04, 0.01), max.iters.hessian = 10, max.hessian.error = 1e-04, factor.brent = 100, maxiters.hessian.brent = 10, num.intervals.brent = 100, min.pdf = 0.001, n.grid = 250, std.area = TRUE, marginal.quantiles = c(0.025, 0.25, 0.5, 0.75, 0.975), max.grid.iter = 1000, marginal.node = NULL, marginal.param = NULL, variate.vec = NULL, ncores = 1, cluster.type = "FORK", max.irls = 100, tol = 1e-11, tolPwrss = 1e-07, check.rankX = "message+drop.cols", check.scaleX = "message+rescale", check.conv.grad = "message", check.conv.singular = "message", check.conv.hess = "message", xtol_abs = 1e-06, ftol_abs = 1e-06, trace.mblogit = FALSE, catcov.mblogit = "free", epsilon = 1e-06, seed = 9062019L )
method |
a character that takes one of two values: "bayes" or "mle". Overrides |
max.mode.error |
if the estimated modes from INLA differ by a factor of |
mean |
the prior mean for all the Gaussian additive terms for each node. INLA argument |
prec |
the prior precision ( |
loggam.shape |
the shape parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
loggam.inv.scale |
the inverse scale parameter in the Gamma distribution prior for the precision in a Gaussian node. INLA argument |
max.iters |
total number of iterations allowed when estimating the modes in Laplace approximation. passed to |
epsabs |
absolute error when estimating the modes in Laplace approximation for models with no random effects. Passed to |
error.verbose |
logical, additional output in the case of errors occurring in the optimization. Passed to |
trace |
Non-negative integer. If positive, tracing information on the progress of the "L-BFGS-B" optimization is produced. Higher values may produce more tracing information. (There are six levels of tracing. To understand exactly what these do see the source code.). Passed to |
epsabs.inner |
absolute error in the maximization step in the (nested) Laplace approximation for each random effect term. Passed to |
max.iters.inner |
total number of iterations in the maximization step in the nested Laplace approximation. Passed to |
finite.step.size |
suggested step length used in finite difference estimation of the derivatives for the (outer) Laplace approximation when estimating modes. Passed to |
hessian.params |
a numeric vector giving parameters for the adaptive algorithm, which determines the optimal stepsize in the finite-difference estimation of the hessian. First entry is the initial guess, second entry absolute error. Passed to |
max.iters.hessian |
integer, maximum number of iterations to use when determining an optimal finite difference approximation (Nelder-Mead). Passed to |
max.hessian.error |
if the estimated log marginal likelihood when using an adaptive 5pt finite-difference rule for the Hessian differs by more than |
factor.brent |
if using Brent-Dekker root bracketing method then define the outer most interval end points as the best estimate of |
maxiters.hessian.brent |
maximum number of iterations allowed in the Brent-Dekker method. Passed to |
num.intervals.brent |
the number of initial different bracket segments to try in the Brent-Dekker method. Passed to |
min.pdf |
the value of the posterior density function below which we stop the estimation only used when computing marginals, see details. |
n.grid |
recompute density on an equally spaced grid with |
std.area |
logical, should the area under the estimated posterior density be standardized to exactly one, useful for error checking. |
marginal.quantiles |
vector giving quantiles at which to compute the posterior marginal distribution at. |
max.grid.iter |
gives number of grid points to estimate posterior density at when not explicitly specifying a grid used to avoid excessively long computation. |
marginal.node |
used in conjunction with |
marginal.param |
used in conjunction with |
variate.vec |
a vector containing the places to evaluate the posterior marginal density, must be supplied if |
ncores |
The number of cores to parallelize to, see ‘Details’. If >0, the number of CPU cores to be used. -1 for all available -1 core. Only for |
cluster.type |
The type of cluster to be used, see |
max.irls |
total number of iterations for estimating network scores using an Iterative Reweighed Least Square algorithm. Is this DEPRECATED? |
tol |
real number giving the minimal tolerance expected to terminate the Iterative Reweighed Least Square algorithm to estimate network score. Passed to |
tolPwrss |
numeric scalar passed to |
check.rankX |
character passed to |
check.scaleX |
character passed to |
check.conv.grad |
character passed to |
check.conv.singular |
character passed to |
check.conv.hess |
character passed to |
xtol_abs |
Defaults to 1e-6 stop on small change of parameter value. Only for |
ftol_abs |
Defaults to 1e-6 stop on small change in deviance. Similar to |
trace.mblogit |
logical indicating if output should be produced for each iteration. Directly passed to |
catcov.mblogit |
Defaults to "free" meaning that there are no restrictions on the covariances of random effects between the logit equations. Set to "diagonal" if random effects pertinent to different categories are uncorrelated or "single" if random effect variances pertinent to all categories are identical. |
epsilon |
Defaults to 1e-8. Positive convergence tolerance |
seed |
a non-negative integer which sets the seed in |
Parallelization over all children is possible via the function foreach
of the package doParallel. ncores=0
or ncores=1
use single threaded foreach
. ncores=-1
uses all available cores but one.
a list of control parameters for the fitAbn
function.
Other fitAbn:
fitAbn()
ctrlmle <- abn::fit.control(method = "mle", max.irls = 100, tol = 10^-11, tolPwrss = 1e-7, xtol_abs = 1e-6, ftol_abs = 1e-6, epsilon = 1e-6, ncores = 2, cluster.type = "PSOCK", seed = 9062019L) ctrlbayes <- abn::fit.control(method = "bayes", mean = 0, prec = 0.001, loggam.shape = 1, loggam.inv.scale = 5e-05, max.mode.error = 10, max.iters = 100, epsabs = 1e-07, error.verbose = FALSE, epsabs.inner = 1e-06, max.iters.inner = 100, finite.step.size = 1e-07, hessian.params = c(1e-04, 0.01), max.iters.hessian = 10, max.hessian.error = 1e-04, factor.brent = 100, maxiters.hessian.brent = 10, num.intervals.brent = 100, min.pdf = 0.001, n.grid = 100, std.area = TRUE, marginal.quantiles = c(0.025, 0.25, 0.5, 0.75, 0.975), max.grid.iter = 1000, marginal.node = NULL, marginal.param = NULL, variate.vec = NULL, ncores = 1, cluster.type = NULL, seed = 9062019L)
ctrlmle <- abn::fit.control(method = "mle", max.irls = 100, tol = 10^-11, tolPwrss = 1e-7, xtol_abs = 1e-6, ftol_abs = 1e-6, epsilon = 1e-6, ncores = 2, cluster.type = "PSOCK", seed = 9062019L) ctrlbayes <- abn::fit.control(method = "bayes", mean = 0, prec = 0.001, loggam.shape = 1, loggam.inv.scale = 5e-05, max.mode.error = 10, max.iters = 100, epsabs = 1e-07, error.verbose = FALSE, epsabs.inner = 1e-06, max.iters.inner = 100, finite.step.size = 1e-07, hessian.params = c(1e-04, 0.01), max.iters.hessian = 10, max.hessian.error = 1e-04, factor.brent = 100, maxiters.hessian.brent = 10, num.intervals.brent = 100, min.pdf = 0.001, n.grid = 100, std.area = TRUE, marginal.quantiles = c(0.025, 0.25, 0.5, 0.75, 0.975), max.grid.iter = 1000, marginal.node = NULL, marginal.param = NULL, variate.vec = NULL, ncores = 1, cluster.type = NULL, seed = 9062019L)
Extract Standard Deviations from all Gaussian Nodes
getMSEfromModes(modes, dists)
getMSEfromModes(modes, dists)
modes |
list of modes. |
dists |
list of distributions. |
named numeric vector. Names correspond to node name. Value to standard deviations.
This function returns standard metrics for DAG description. A list that contains the number of nodes, the number of arcs, the average Markov blanket size, the neighborhood average set size, the parent average set size and children average set size.
infoDag(object, node.names = NULL)
infoDag(object, node.names = NULL)
object |
an object of class |
node.names |
a vector of names if the DAG is given via formula, see details. |
This function returns a named list with the following entries: the number of nodes, the number of arcs, the average Markov blanket size, the neighborhood average set size, the parent average set size, and the children's average set size.
The dag
can be provided using a formula statement (similar to glm).
A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
.
The formula statement have to start with ~
.
In this example, node1 has two parents (parent1 and parent2).
node2 and node3 have the same parent3.
The parents names have to exactly match those given in node.names
.
:
is the separator between either children or parents,
|
separates children (left side) and parents (right side),
+
separates terms, .
replaces all the variables in node.names
.
A named list that contains following entries: the number of nodes, the number of arcs, the average Markov blanket size, the neighborhood average set size, the parent average set size and children average set size.
West, D. B. (2001). Introduction to graph theory. Vol. 2. Upper Saddle River: Prentice Hall.
## Creating a dag: dag <- matrix(c(0,0,0,0, 1,0,0,0, 1,1,0,1, 0,1,0,0), nrow = 4, ncol = 4) dist <- list(a="gaussian", b="gaussian", c="gaussian", d="gaussian") colnames(dag) <- rownames(dag) <- names(dist) infoDag(dag) plot(createAbnDag(dag = dag, data.dists = dist))
## Creating a dag: dag <- matrix(c(0,0,0,0, 1,0,0,0, 1,1,0,1, 0,1,0,0), nrow = 4, ncol = 4) dist <- list(a="gaussian", b="gaussian", c="gaussian", d="gaussian") colnames(dag) <- rownames(dag) <- names(dist) infoDag(dag) plot(createAbnDag(dag = dag, data.dists = dist))
A flexible implementation of multiple proxy for strength measures useful for visualizing the edge connections in a Bayesian Network learned from observational data.
linkStrength(dag, data.df = NULL, data.dists = NULL, method = c("mi.raw", "mi.raw.pc", "mi.corr", "ls", "ls.pc", "stat.dist"), discretization.method = "doane")
linkStrength(dag, data.df = NULL, data.dists = NULL, method = c("mi.raw", "mi.raw.pc", "mi.corr", "ls", "ls.pc", "stat.dist"), discretization.method = "doane")
dag |
a matrix or a formula statement (see details for format) defining
the network structure, a directed acyclic graph (DAG).
Note that rownames must be set or given in |
data.df |
a data frame containing the data used for learning each node, binary variables must be declared as factors. |
data.dists |
a named list giving the distribution for each node in the network, see ‘Details’. |
method |
the method to be used. See ‘Details’. |
discretization.method |
a character vector giving the discretization
method to use. See |
This function returns multiple proxies for estimating the connection strength
of the edges of a possibly discretized Bayesian network's data set.
The returned connection strength measures are: the Raw Mutual Information
(mi.raw
), the Percentage Mutual information (mi.raw.pc
),
the Raw Mutual Information computed via correlation (mi.corr
),
the link strength (ls
), the percentage link strength (ls.pc
)
and the statistical distance (stat.dist
).
The general concept of entropy is defined for probability distributions. The probability is estimated from data using frequency tables. Then the estimates are plug-in in the definition of the entropy to return the so-called empirical entropy. A standard known problem of empirical entropy is that the estimations are biased due to the sampling noise. This is also known that the bias will decrease as the sample size increases. The mutual information estimation is computed from the observed frequencies through a plug-in estimator based on entropy. For the case of an arc going from the node X to the node Y and the remaining set of parent of Y is denoted as Z.
The mutual information is defined as I(X, Y) = H(X) + H(Y) - H(X, Y), where H() is the entropy.
The Percentage Mutual information is defined as PI(X,Y) = I(X,Y)/H(Y|Z).
The Mutual Information computed via correlation is defined as MI(X,Y) = -0.5 log(1-cor(X,Y)^2).
The link strength is defined as LS(X->Y) = H(Y|Z)-H(Y|X,Z).
The percentage link strength is defined as PLS(X->Y) = LS(X->Y) / H(Y|Z).
The statistical distance is defined as SD(X,Y) = 1- MI(X,Y) / max(H(X),H(Y)).
The function returns a named matrix with the requested metric.
Boerlage, B. (1992). Link strength in Bayesian networks. Diss. University of British Columbia. Ebert-Uphoff, Imme. "Tutorial on how to measure link strengths in discrete Bayesian networks." (2009).
# Gaussian N <- 1000 mydists <- list(a="gaussian", b="gaussian", c="gaussian") a <- rnorm(n = N, mean = 0, sd = 1) b <- 1 + 2*rnorm(n = N, mean = 5, sd = 1) c <- 2 + 1*a + 2*b + rnorm(n = N, mean = 2, sd = 1) mydf <- data.frame("a" = a, "b" = b, "c" = c) mycache.mle <- buildScoreCache(data.df = mydf, data.dists = mydists, method = "mle", max.parents = 2) mydag.mp <- mostProbable(score.cache = mycache.mle, verbose = FALSE) linkstr <- linkStrength(dag = mydag.mp$dag, data.df = mydf, data.dists = mydists, method = "ls", discretization.method = "sturges")
# Gaussian N <- 1000 mydists <- list(a="gaussian", b="gaussian", c="gaussian") a <- rnorm(n = N, mean = 0, sd = 1) b <- 1 + 2*rnorm(n = N, mean = 5, sd = 1) c <- 2 + 1*a + 2*b + rnorm(n = N, mean = 2, sd = 1) mydf <- data.frame("a" = a, "b" = b, "c" = c) mycache.mle <- buildScoreCache(data.df = mydf, data.dists = mydists, method = "mle", max.parents = 2) mydag.mp <- mostProbable(score.cache = mycache.mle, verbose = FALSE) linkstr <- linkStrength(dag = mydag.mp$dag, data.df = mydf, data.dists = mydists, method = "ls", discretization.method = "sturges")
See also the C implementation ?abn::logit_cpp()
.
logit(x)
logit(x)
x |
numeric with values between |
numeric vector of same length as x
.
numeric vector of same length as x
.
transform x
either via the logit, or expit.
logit_cpp(x)
logit_cpp(x)
x |
a numeric vector |
a numeric vector
abnFit
Print logLik of objects of class abnFit
## S3 method for class 'abnFit' logLik(object, digits = 3L, verbose = TRUE, ...)
## S3 method for class 'abnFit' logLik(object, digits = 3L, verbose = TRUE, ...)
object |
Object of class |
digits |
number of digits of the results. |
verbose |
print additional output. |
... |
additional parameters. Not used at the moment. |
prints the logLik of the fitted model.
This function computes the Markov blanket of a set of nodes given a DAG (Directed Acyclic Graph).
mb(dag, node, data.dists = NULL)
mb(dag, node, data.dists = NULL)
dag |
a matrix or a formula statement (see details for format) defining the network structure, a directed acyclic graph (DAG). |
node |
a character vector of the nodes for which the Markov Blanket should be returned. |
data.dists |
a named list giving the distribution for each node in the network, see details. |
This function returns the Markov Blanket of a set of nodes given a DAG.
The dag
can be provided using a formula statement (similar to glm). A typical formula is ~ node1|parent1:parent2 + node2:node3|parent3
. The formula statement have to start with ~
. In this example, node1 has two parents (parent1 and parent2). node2 and node3 have the same parent3. The parents names have to exactly match those given in name
. :
is the separtor between either children or parents, |
separates children (left side) and parents (right side), +
separates terms, .
replaces all the variables in name
.
character vector of node names from the Markov blanket.
## Defining distribution and dag dist <- list(a="gaussian", b="gaussian", c="gaussian", d="gaussian", e="binomial", f="binomial") dag <- matrix(c(0,1,1,0,1,0, 0,0,1,1,0,1, 0,0,0,0,0,0, 0,0,0,0,0,0, 0,0,0,0,0,1, 0,0,0,0,0,0), nrow = 6L, ncol = 6L, byrow = TRUE) colnames(dag) <- rownames(dag) <- names(dist) mb(dag, node = "b") mb(dag, node = c("b","e")) mb(~a|b:c:e+b|c:d:f+e|f, node = "e", data.dists = dist)
## Defining distribution and dag dist <- list(a="gaussian", b="gaussian", c="gaussian", d="gaussian", e="binomial", f="binomial") dag <- matrix(c(0,1,1,0,1,0, 0,0,1,1,0,1, 0,0,0,0,0,0, 0,0,0,0,0,0, 0,0,0,0,0,1, 0,0,0,0,0,0), nrow = 6L, ncol = 6L, byrow = TRUE) colnames(dag) <- rownames(dag) <- names(dist) mb(dag, node = "b") mb(dag, node = c("b","e")) mb(~a|b:c:e+b|c:d:f+e|f, node = "e", data.dists = dist)
This function empirically estimates the Mutual Information from a table of counts using the observed frequencies.
miData(freqs.table, method = c("mi.raw", "mi.raw.pc"))
miData(freqs.table, method = c("mi.raw", "mi.raw.pc"))
freqs.table |
a table of counts. |
method |
a character determining if the Mutual Information should be normalized. |
The mutual information estimation is computed from the observed frequencies through a plugin estimator based on entropy.
The plugin estimator is
, where
is the entropy computed with entropyData
.
Mutual information estimate.
integer
Cover, Thomas M, and Joy A Thomas. (2012). "Elements of Information Theory". John Wiley & Sons.
## Generate random variable Y <- rnorm(n = 100, mean = 0, sd = 2) X <- rnorm(n = 100, mean = 5, sd = 2) dist <- list(Y="gaussian", X="gaussian") miData(discretization(data.df = cbind(X,Y), data.dists = dist, discretization.method = "fd", nb.states = FALSE), method = "mi.raw")
## Generate random variable Y <- rnorm(n = 100, mean = 0, sd = 2) X <- rnorm(n = 100, mean = 5, sd = 2) dist <- list(Y="gaussian", X="gaussian") miData(discretization(data.df = cbind(X,Y), data.dists = dist, discretization.method = "fd", nb.states = FALSE), method = "mi.raw")
Convert modes to fitAbn.mle$coefs structure
modes2coefs(modes)
modes2coefs(modes)
modes |
list of modes. |
list of matrix arrays.
Find most probable DAG structure using exact order based approach due to Koivisto and Sood, 2004.
mostProbable(score.cache, score="bic", prior.choice=1, verbose=TRUE, ...)
mostProbable(score.cache, score="bic", prior.choice=1, verbose=TRUE, ...)
score.cache |
object of class |
score |
which score should be used to score the network. Possible choices are |
prior.choice |
an integer, 1 or 2, where 1 is a uniform structural prior and 2 uses a weighted prior, see details |
verbose |
if TRUE then provides some additional output. |
... |
further arguments passed to or from other methods. |
The procedure runs the exact order based structure discovery approach of Koivisto and Sood (2004) to find the most probable posterior network (DAG). The local.score is the node cache, as created using buildScoreCache
(or manually provided the same format is used). Note that the scope of this search is given by the options used in local.score, for example, by restricting the number of parents or the ban or retain constraints given there.
This routine can take a long time to complete and is highly sensitive to the number of nodes in the network. It is recommended to use this on a reduced data set to get an idea as to the computational practicality of this approach. In particular, memory usage can quickly increase to beyond what may be available. For additive models, problems comprising up to 20 nodes are feasible on most machines. Memory requirements can increase considerably after this, but then so does the run time making this less practical. It is recommended that some form of over-modeling adjustment is performed on this resulting DAG (unless dealing with vast numbers of observations), for example, using parametric bootstrapping, which is straightforward to implement in MCMC engines such as JAGS or WinBUGS. See the case studies at https://r-bayesian-networks.org/
or the files provided in the package directories inst/bootstrapping_example
and inst/old_vignette
for details.
The parameter prior.choice
determines the prior used within each node for a given choice of parent combination. In Koivisto and Sood (2004) p.554, a form of prior is used, which assumes that the prior probability for parent combinations comprising of the same number of parents are all equal. Specifically, that the prior probability for parent set G with cardinality |G| is proportional to 1/[n-1 choose |G|]
where there are n total nodes. Note that this favors parent combinations with either very low or very high cardinality, which may not be appropriate. This prior is used when prior.choice=2
. When prior.choice=1
an uninformative prior is used where parent combinations of all cardinalities are equally likely. The latter is equivalent to the structural prior used in the heuristic searches e.g., searchHillclimber
or searchHeuristic
.
Note that the network score (log marginal likelihood) of the most probable DAG is not returned as it can easily be computed using fitAbn
, see examples below.
An object of class abnMostprobable
, which is a list containing: a matrix giving the DAG definition of the most probable posterior structure, the cache of pre-computed scores and the score used for selection.
Koivisto, M. V. (2004). Exact Structure Discovery in Bayesian Networks, Journal of Machine Learning Research, vol 5, 549-573.
## Not run: ############################## ## Example 1 ############################## ## This data comes with 'abn' see ?ex1.dag.data mydat <- ex1.dag.data[1:5000, c(1:7, 10)] ## Setup distribution list for each node: mydists <- list(b1 = "binomial", p1 = "poisson", g1 = "gaussian", b2 = "binomial", p2 = "poisson", b3 = "binomial", g2 = "gaussian", g3 = "gaussian") ## Parent limits, for speed purposes quite specific here: max_par <- list("b1" = 0, "p1" = 0, "g1" = 1, "b2" = 1, "p2" = 2, "b3" = 3, "g2" = 3, "g3" = 2) ## Now build cache (no constraints in ban nor retain) mycache <- buildScoreCache(data.df = mydat, data.dists = mydists, max.parents = max_par) ## Find the globally best DAG: mp_dag <- mostProbable(score.cache = mycache) myres <- fitAbn(object = mp_dag, create.graph = TRUE) plot(myres) # plot the best model ## Fit the known true DAG (up to variables 'b4' and 'b5'): true_dag <- matrix(data = 0, ncol = 8, nrow = 8) colnames(true_dag) <- rownames(true_dag) <- names(mydists) true_dag["p2", c("b1", "p1")] <- 1 true_dag["b3", c("b1", "g1", "b2")] <- 1 true_dag["g2", c("p1", "g1", "b2")] <- 1 true_dag["g3", c("g1", "b2")] <- 1 fitAbn(dag = true_dag, data.df = mydat, data.dists = mydists)$mlik ################################################################# ## Example 2 - models with random effects ################################################################# ## This data comes with abn see ?ex3.dag.data mydat <- ex3.dag.data[, c(1:4, 14)] mydists <- list(b1 = "binomial", b2 = "binomial", b3 = "binomial", b4 = "binomial") ## This takes a few seconds and requires INLA: mycache_mixed <- buildScoreCache(data.df = mydat, data.dists = mydists, group.var = "group", max.parents = 2) ## Find the most probable DAG: mp_dag <- mostProbable(score.cache = mycache_mixed) ## and get goodness of fit: fitAbn(object = mp_dag, group.var = "group")$mlik ## End(Not run)
## Not run: ############################## ## Example 1 ############################## ## This data comes with 'abn' see ?ex1.dag.data mydat <- ex1.dag.data[1:5000, c(1:7, 10)] ## Setup distribution list for each node: mydists <- list(b1 = "binomial", p1 = "poisson", g1 = "gaussian", b2 = "binomial", p2 = "poisson", b3 = "binomial", g2 = "gaussian", g3 = "gaussian") ## Parent limits, for speed purposes quite specific here: max_par <- list("b1" = 0, "p1" = 0, "g1" = 1, "b2" = 1, "p2" = 2, "b3" = 3, "g2" = 3, "g3" = 2) ## Now build cache (no constraints in ban nor retain) mycache <- buildScoreCache(data.df = mydat, data.dists = mydists, max.parents = max_par) ## Find the globally best DAG: mp_dag <- mostProbable(score.cache = mycache) myres <- fitAbn(object = mp_dag, create.graph = TRUE) plot(myres) # plot the best model ## Fit the known true DAG (up to variables 'b4' and 'b5'): true_dag <- matrix(data = 0, ncol = 8, nrow = 8) colnames(true_dag) <- rownames(true_dag) <- names(mydists) true_dag["p2", c("b1", "p1")] <- 1 true_dag["b3", c("b1", "g1", "b2")] <- 1 true_dag["g2", c("p1", "g1", "b2")] <- 1 true_dag["g3", c("g1", "b2")] <- 1 fitAbn(dag = true_dag, data.df = mydat, data.dists = mydists)$mlik ################################################################# ## Example 2 - models with random effects ################################################################# ## This data comes with abn see ?ex3.dag.data mydat <- ex3.dag.data[, c(1:4, 14)] mydists <- list(b1 = "binomial", b2 = "binomial", b3 = "binomial", b4 = "binomial") ## This takes a few seconds and requires INLA: mycache_mixed <- buildScoreCache(data.df = mydat, data.dists = mydists, group.var = "group", max.parents = 2) ## Find the most probable DAG: mp_dag <- mostProbable(score.cache = mycache_mixed) ## and get goodness of fit: fitAbn(object = mp_dag, group.var = "group")$mlik ## End(Not run)
abnFit
Print number of observations of objects of class abnFit
## S3 method for class 'abnFit' nobs(object, ...)
## S3 method for class 'abnFit' nobs(object, ...)
object |
Object of class |
... |
additional parameters. Not used at the moment. |
prints the number of observations of the fitted model.
Probability to odds
odds(x)
odds(x)
x |
numeric vector of probabilities with values between |
numeric vector of same length as x
.
Compute the odds ratio from a contingency table or a matrix.
or(x)
or(x)
x |
a 2x2 table or matrix. |
A real value.
abnDag
Plots DAG from an object of class abnDag
## S3 method for class 'abnDag' plot(x, ...)
## S3 method for class 'abnDag' plot(x, ...)
x |
Object of class |
... |
additional parameters. Not used at the moment. |
Rgraphviz::plot
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1), data.dists = list(a="binomial", b="gaussian")) plot(mydag)
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1), data.dists = list(a="binomial", b="gaussian")) plot(mydag)
abnFit
Plot objects of class abnFit
## S3 method for class 'abnFit' plot(x, ...)
## S3 method for class 'abnFit' plot(x, ...)
x |
Object of class |
... |
additional parameters. Not used at the moment. |
a plot of the fitted model.
abnHeuristic
Plot objects of class abnHeuristic
## S3 method for class 'abnHeuristic' plot(x, ...)
## S3 method for class 'abnHeuristic' plot(x, ...)
x |
Object of class |
... |
additional parameters. Not used at the moment. |
plot of the scores of the heuristic search.
abnHillClimber
Plot objects of class abnHillClimber
## S3 method for class 'abnHillClimber' plot(x, ...)
## S3 method for class 'abnHillClimber' plot(x, ...)
x |
Object of class |
... |
additional parameters. Not used at the moment. |
plot of the consensus DAG.
abnMostprobable
Plot objects of class abnMostprobable
## S3 method for class 'abnMostprobable' plot(x, ...)
## S3 method for class 'abnMostprobable' plot(x, ...)
x |
Object of class |
... |
additional parameters. Not used at the moment. |
plot of the mostprobable consensus DAG.
abnCache
Print objects of class abnCache
## S3 method for class 'abnCache' print(x, digits = 3, ...)
## S3 method for class 'abnCache' print(x, digits = 3, ...)
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
summary statement of the class of abnCache
.
## Subset of the build-in dataset, see ?ex0.dag.data mydat <- ex0.dag.data[,c("b1","b2","g1","g2","b3","g3")] ## take a subset of cols ## setup distribution list for each node mydists <- list(b1="binomial", b2="binomial", g1="gaussian", g2="gaussian", b3="binomial", g3="gaussian") # Structural constraints # ban arc from b2 to b1 # always retain arc from g2 to g1 ## parent limits max.par <- list("b1"=2, "b2"=2, "g1"=2, "g2"=2, "b3"=2, "g3"=2) ## now build the cache of pre-computed scores accordingly to the structural constraints if(requireNamespace("INLA", quietly = TRUE)){ # Run only if INLA is available res.c <- buildScoreCache(data.df=mydat, data.dists=mydists, dag.banned= ~b1|b2, dag.retained= ~g1|g2, max.parents=max.par) print(res.c) }
## Subset of the build-in dataset, see ?ex0.dag.data mydat <- ex0.dag.data[,c("b1","b2","g1","g2","b3","g3")] ## take a subset of cols ## setup distribution list for each node mydists <- list(b1="binomial", b2="binomial", g1="gaussian", g2="gaussian", b3="binomial", g3="gaussian") # Structural constraints # ban arc from b2 to b1 # always retain arc from g2 to g1 ## parent limits max.par <- list("b1"=2, "b2"=2, "g1"=2, "g2"=2, "b3"=2, "g3"=2) ## now build the cache of pre-computed scores accordingly to the structural constraints if(requireNamespace("INLA", quietly = TRUE)){ # Run only if INLA is available res.c <- buildScoreCache(data.df=mydat, data.dists=mydists, dag.banned= ~b1|b2, dag.retained= ~g1|g2, max.parents=max.par) print(res.c) }
abnDag
Print objects of class abnDag
## S3 method for class 'abnDag' print(x, digits = 3L, ...)
## S3 method for class 'abnDag' print(x, digits = 3L, ...)
x |
Object of class |
digits |
number of digits of the adjacency matrix. |
... |
additional parameters. Not used at the moment. |
outputs adjacency matrix and statement of the class of x
.
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1)) print(mydag)
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1)) print(mydag)
abnFit
Print objects of class abnFit
## S3 method for class 'abnFit' print(x, digits = 3L, ...)
## S3 method for class 'abnFit' print(x, digits = 3L, ...)
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
prints the parameters of the fitted model.
abnHeuristic
Print objects of class abnHeuristic
## S3 method for class 'abnHeuristic' print(x, digits = 2L, ...)
## S3 method for class 'abnHeuristic' print(x, digits = 2L, ...)
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
prints the best score found and the distribution of the scores.
abnHillClimber
Print objects of class abnHillClimber
## S3 method for class 'abnHillClimber' print(x, digits = 3L, ...)
## S3 method for class 'abnHillClimber' print(x, digits = 3L, ...)
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
prints the consensus DAG and the class of the object.
abnMostprobable
Print objects of class abnMostprobable
## S3 method for class 'abnMostprobable' print(x, digits = 3L, ...)
## S3 method for class 'abnMostprobable' print(x, digits = 3L, ...)
x |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
prints the mostprobable consensus DAG.
This function computes the score's contribution of each observation to the total network score.
scoreContribution(object = NULL, dag = NULL, data.df = NULL, data.dists = NULL, verbose = FALSE)
scoreContribution(object = NULL, dag = NULL, data.df = NULL, data.dists = NULL, verbose = FALSE)
object |
an object of class ' |
dag |
a matrix or a formula statement (see details) defining the network structure, a directed acyclic graph (DAG), see details for format. Note that colnames and rownames must be set. |
data.df |
a data frame containing the data used for learning the network, binary variables must be declared as factors and no missing values all allowed in any variable. |
data.dists |
a named list giving the distribution for each node in the network, see details. |
verbose |
if |
This function computes the score contribution of each observation
to the total network score.
This function is available only in the mle
settings.
To do so one uses the glm
and predict
functions.
This function is an attempt to perform diagnostic for an ABN analysis.
A named list that contains the scores contributions: maximum likelihood, aic, bic, mdl and diagonal values of the hat matrix.
## Not run: ## Use a subset of a built-in simulated data set mydat <- ex1.dag.data[,c("b1","g1","p1")] ## setup distribution list for each node mydists <- list(b1="binomial", g1="gaussian", p1="poisson") ## now build cache mycache <- buildScoreCache(data.df = mydat, data.dists = mydists, max.parents = 2, method = "mle") ## Find the globally best DAG mp.dag <- mostProbable(score.cache=mycache, score="bic", verbose = FALSE) out <- scoreContribution(object = mp.dag) ## Observations contribution per network node boxplot(out$bic) ## End(Not run)
## Not run: ## Use a subset of a built-in simulated data set mydat <- ex1.dag.data[,c("b1","g1","p1")] ## setup distribution list for each node mydists <- list(b1="binomial", g1="gaussian", p1="poisson") ## now build cache mycache <- buildScoreCache(data.df = mydat, data.dists = mydists, max.parents = 2, method = "mle") ## Find the globally best DAG mp.dag <- mostProbable(score.cache=mycache, score="bic", verbose = FALSE) out <- scoreContribution(object = mp.dag) ## Observations contribution per network node boxplot(out$bic) ## End(Not run)
A flexible implementation of multiple greedy search algorithms to find high scoring network (DAG)
searchHeuristic(score.cache, score = "mlik", num.searches = 1, seed = 42L, start.dag = NULL, max.steps = 100, algo = "hc", tabu.memory = 10, temperature = 0.9, verbose = FALSE, ...)
searchHeuristic(score.cache, score = "mlik", num.searches = 1, seed = 42L, start.dag = NULL, max.steps = 100, algo = "hc", tabu.memory = 10, temperature = 0.9, verbose = FALSE, ...)
score.cache |
output from |
score |
which score should be used to score the network. Possible choices are |
num.searches |
a positive integer giving the number of different search to run, see details. |
seed |
a non-negative integer which sets the seed. |
start.dag |
a DAG given as a matrix, see details for format, which can be used to explicity provide a starting point for the structural search. |
max.steps |
a constant giving the number of search steps per search, see details. |
algo |
which heuristic algorithm should be used. Possible choices are: |
tabu.memory |
a non-negative integer number to set the memory of the |
temperature |
a real number giving the update in temperature for the |
verbose |
if TRUE then provides some additional output. |
... |
further arguments passed to or from other methods. |
This function is a flexible implementation of multiple greedy heuristic algorithms,
particularly well adapted to the abn
framework.
It targets multi-random restarts heuristic algorithms.
The user can select the num.searches
and the maximum number of steps
within by max.steps
. The optimization algorithm within each search is
relatively rudimentary.
The function searchHeuristic
is different from the
searchHillClimber
in the sense that this function is fully
written in R, whereas the searchHillClimber
is written in C
and thus expected to be faster. The function searchHillClimber
at each hill-climbing step consider every information from the pre-computed
scores cache while the function searchHeuristic
performs a local
stepwise optimization. This function chooses a structural move (or edge move)
and compute the score's change. On this point, it is closer to the MCMCMC
algorithm from Madigan and York (1995) and Giudici and Castelo (2003)
with a single edge move.
If the user select random
, then a random valid DAG is selected.
The routine used favourise low density structure.
The function implements three algorithm selected with the
parameter algo
: hc
, tabu
or sa
.
If algo=hc
:
The Hill-climber algorithm (hc
) is a single move algorithm.
At each Hill-climbing step within a search an arc is attempted to be added.
The new score is computed and compared to the previous network's score.
If algo=tabu
:
The same algorithm is as with hc
is used, but a list of banned moves
is computed. The parameter tabu.memory
controls the length of the tabu
list. The idea is that the classical Hill-climber algorithm is inefficient
when it should cross low probability regions to unblock from a local maximum
and reaching a higher score peak. By forcing the algorithm to choose some not
already used moves, this will force the algorithm to escape the local maximum.
If algo=sa
:
This variant of the heuristic search algorithm is based on simulated annealing
described by Metropolis et al. (1953).
Some accepted moves could result in a decrease of the network score.
The acceptance rate can be monitored with the parameter temperature
.
An object of class abnHeuristic
(which extends the class abnLearnd
) and contains list with entires:
a list of DAGs
a vector giving the network score for the locally optimal network for each search
a vector giving the evolution of the network score for the all the random restarts
a number giving the network score for the locally optimal network
the pre-computed cache of scores
a numeric giving the number of random restart
a numeric giving the maximal number of optimization steps within each search
a character for indicating the algorithm used
Heckerman, D., Geiger, D. and Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197-243. Madigan, D. and York, J. (1995) "Bayesian graphical models for discrete data". International Statistical Review, 63:215232. Giudici, P. and Castelo, R. (2003). "Improving Markov chain Monte Carlo model search for data mining". Machine Learning, 50:127158. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). "Equation of state calculations by fast computing machines". The journal of chemical physics, 21(6), 1087-1092.
## Not run: ############################################## ## example: use built-in simulated data set ############################################## mydat <- ex1.dag.data ## this data comes with abn see ?ex1.dag.data ## setup distribution list for each node mydists<-list(b1="binomial", p1="poisson", g1="gaussian", b2="binomial", p2="poisson", b3="binomial", g2="gaussian", b4="binomial", b5="binomial", g3="gaussian") mycache <- buildScoreCache(data.df = mydat, data.dists = mydists, max.parents = 2) ## Now peform 10 greedy searches heur.res <- searchHeuristic(score.cache = mycache, data.dists = mydists, start.dag = "random", num.searches = 10, max.steps = 50) ## Plot (one) dag plotAbn(heur.res$dags[[1]], data.dists = mydists) ## End(Not run)
## Not run: ############################################## ## example: use built-in simulated data set ############################################## mydat <- ex1.dag.data ## this data comes with abn see ?ex1.dag.data ## setup distribution list for each node mydists<-list(b1="binomial", p1="poisson", g1="gaussian", b2="binomial", p2="poisson", b3="binomial", g2="gaussian", b4="binomial", b5="binomial", g3="gaussian") mycache <- buildScoreCache(data.df = mydat, data.dists = mydists, max.parents = 2) ## Now peform 10 greedy searches heur.res <- searchHeuristic(score.cache = mycache, data.dists = mydists, start.dag = "random", num.searches = 10, max.steps = 50) ## Plot (one) dag plotAbn(heur.res$dags[[1]], data.dists = mydists) ## End(Not run)
Find high scoring network (DAG) structures using a random re-starts greedy hill-climber heuristic search.
searchHillClimber(score.cache, score = "mlik", num.searches = 1, seed = 42, start.dag = NULL, support.threshold = 0.5, timing.on = TRUE, dag.retained = NULL, verbose = FALSE, ...)
searchHillClimber(score.cache, score = "mlik", num.searches = 1, seed = 42, start.dag = NULL, support.threshold = 0.5, timing.on = TRUE, dag.retained = NULL, verbose = FALSE, ...)
score.cache |
output from |
score |
character giving which network score should be used to
select the structure. Currently |
num.searches |
number of times to run the search. |
seed |
non-negative integer which sets the seed in the GSL random number generator. |
start.dag |
a DAG given as a matrix, see details for format, which can be used to provide a starting point for the structural search explicitly. |
support.threshold |
the proportion of search results - each locally optimal DAG - in which each arc must appear to be a part of the consensus network. |
timing.on |
extra output in terms of duration computation. |
dag.retained |
a DAG given as a matrix, see details for format. This is necessary if the score.cache was created using an explicit retain matrix, and the same retain matrix should be used here. dag.retained is used by the algorithm which generates the initial random DAG to ensure that the necessary arcs are retained. |
verbose |
extra output. |
... |
further arguments passed to or from other methods. |
The procedure runs a greedy hill-climbing search similar, but not identical,
to the method presented initially in Heckerman et al. 1995.
(Machine Learning, 20, 197-243).
Each search begins with a randomly chosen DAG structure where this is
constructed in such a way as to attempt to choose a DAG uniformly from
the vast landscape of possible structures. The algorithm used is as follows:
given a node cache (from buildScoreCache
,
then within this set of all allowed local parent combinations,
a random combination is chosen for each node.
This is then combined into a full DAG, which is then checked for cycles,
where this check iterates over the nodes in a random order.
If all nodes have at least one child (i.e., at least one cycle is present),
then the first node examined has all its children removed, and the check for
cycles is then repeated. If this has removed the only cycle present,
then this DAG is used at the starting point for the search,
if not, a second node is chosen (randomly) and the process is then
repeated until a DAG is obtained.
The actual hill-climbing algorithm itself differs slightly from that presented in Heckerman et al. as a full cache of all possible local combinations are available. At each hill-climbing step, everything in the node cache is considered. In other words, all possible single swaps between members of cache currently present in the DAG and those in the full cache. The single swap, which provides the greatest increase in goodness of fit is chosen. A single swap here is the removal or addition of any one node-parent combination present in the cache while avoiding a cycle. This means that as well as all single arc changes (addition or removal), multiple arc changes are also considered at each same step, note however that arc reversal in this scheme takes two steps (as this requires first removal of a parent arc from one node and then addition of a parent arc to a different node). The original algorithm perturbed the current DAG by only a single arc at each step but also included arc reversal. The current implementation may not be any more efficient than the original but is arguably more natural given a pre-computed cache of local scores.
A start DAG may be provided in which case num.searches must equal 1 - this option is really just to provide a local search around a previously identified optimal DAG.
This function is designed for two different purposes:
i) interactive visualization; and
ii) longer batch processing. The former provides easy visual "eyeballing" of
data in terms of its majority consensus network (or similar threshold),
which is a graphical structure which comprises of all arcs which feature in
a given proportion (support.threshold
) of locally optimal DAGs already
identified during the run. The general hope is that this structure will
stabilize - become fixed - relatively quickly, at least for problems with
smaller numbers of nodes.
A list with entries:
a vector giving network score for initial network from which the search commenced
a vector giving the network score for the locally optimal network
list of matrices, initial DAGs
list of matrices, locally optimal DAGs
a matrix holding a binary graph, not necessary a DAG!
percentage supported used to create consensus matrix
Lewis, F. I., and McCormick, B. J. J. (2012). Revealing the complexity of health determinants in resource poor settings. American Journal Of Epidemiology. DOI:10.1093/aje/KWS183).
Simulate data from a fitted additive Bayesian network.
simulateAbn( object = NULL, run.simulation = TRUE, bugsfile = NULL, n.chains = 10L, n.adapt = 1000L, n.thin = 100L, n.iter = 10000L, seed = 42L, verbose = FALSE, debug = FALSE )
simulateAbn( object = NULL, run.simulation = TRUE, bugsfile = NULL, n.chains = 10L, n.adapt = 1000L, n.thin = 100L, n.iter = 10000L, seed = 42L, verbose = FALSE, debug = FALSE )
object |
of type |
run.simulation |
call JAGS to simulate data (default is |
bugsfile |
A path to a valid file or |
n.chains |
number of parallel chains for the model. |
n.adapt |
number of iteration for adaptation. If |
n.thin |
thinning interval for monitors. |
n.iter |
number of iteration to monitor. |
seed |
by default set to 42. |
verbose |
if TRUE prints additional output |
debug |
if TRUE prints bug file content to stdout and does not run simulations. |
data.frame
df <- FCV[, c(12:15)] mydists <- list(Outdoor="binomial", Sex="multinomial", GroupSize="poisson", Age="gaussian") ## buildScoreCache -> mostProbable() -> fitAbn() suppressWarnings({ mycache.mle <- buildScoreCache(data.df = df, data.dists = mydists, method = "mle", adj.vars = NULL, cor.vars = NULL, dag.banned = NULL, dag.retained = NULL, max.parents = 1, which.nodes = NULL, defn.res = NULL) }) # ignore non-convergence warnings mp.dag.mle <- mostProbable(score.cache = mycache.mle, verbose = FALSE) myres.mle <- fitAbn(object = mp.dag.mle, method = "mle") myres.sim <- simulateAbn(object = myres.mle, run.simulation = TRUE, bugsfile = NULL, verbose = FALSE) str(myres.sim) prop.table(table(myres.sim$Outdoor)) prop.table(table(df$Outdoor))
df <- FCV[, c(12:15)] mydists <- list(Outdoor="binomial", Sex="multinomial", GroupSize="poisson", Age="gaussian") ## buildScoreCache -> mostProbable() -> fitAbn() suppressWarnings({ mycache.mle <- buildScoreCache(data.df = df, data.dists = mydists, method = "mle", adj.vars = NULL, cor.vars = NULL, dag.banned = NULL, dag.retained = NULL, max.parents = 1, which.nodes = NULL, defn.res = NULL) }) # ignore non-convergence warnings mp.dag.mle <- mostProbable(score.cache = mycache.mle, verbose = FALSE) myres.mle <- fitAbn(object = mp.dag.mle, method = "mle") myres.sim <- simulateAbn(object = myres.mle, run.simulation = TRUE, bugsfile = NULL, verbose = FALSE) str(myres.sim) prop.table(table(myres.sim$Outdoor)) prop.table(table(df$Outdoor))
Provided with node names, returns an abnDAG
.
Arc density refers to the chance of a node being connected to the node before it.
simulateDag(node.name, data.dists = NULL, edge.density = 0.5, verbose = FALSE)
simulateDag(node.name, data.dists = NULL, edge.density = 0.5, verbose = FALSE)
node.name |
a vector of character giving the names of the nodes. It gives the size of the simulated DAG. |
data.dists |
named list corresponding to the |
edge.density |
number in |
verbose |
print more information on the run. |
This function generates DAGs by sampling triangular matrices and reorder columns and rows randomly.
The network density (edge.density
) is used column-wise as binomial sampling probability.
Then the matrix is named using the user-provided names.
object of class abnDag
consisting of a named matrix, a named list giving the distribution for each node and an empty element for the data.
simdag <- simulateDag(node.name = c("a", "b", "c", "d"), edge.density = 0.5, data.dists = list(a = "gaussian", b = "binomial", c = "poisson", d = "multinomial")) ## Example using Ozon entries: dist <- list(Ozone="gaussian", Solar.R="gaussian", Wind="gaussian", Temp="gaussian", Month="gaussian", Day="gaussian") out <- simulateDag(node.name = names(dist), data.dists = dist, edge.density = 0.8) plot(out)
simdag <- simulateDag(node.name = c("a", "b", "c", "d"), edge.density = 0.5, data.dists = list(a = "gaussian", b = "binomial", c = "poisson", d = "multinomial")) ## Example using Ozon entries: dist <- list(Ozone="gaussian", Solar.R="gaussian", Wind="gaussian", Temp="gaussian", Month="gaussian", Day="gaussian") out <- simulateDag(node.name = names(dist), data.dists = dist, edge.density = 0.8) plot(out)
Computes skewness of a distribution
skewness(x)
skewness(x)
x |
a numeric vector |
integer
abnDag
Prints summary statistics from an object of class abnDag
## S3 method for class 'abnDag' summary(object, ...)
## S3 method for class 'abnDag' summary(object, ...)
object |
an object of class |
... |
additional parameters. Not used at the moment. |
List with summary statistics of the DAG.
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1)) summary(mydag)
mydag <- createAbnDag(dag = ~a+b|a, data.df = data.frame("a"=1, "b"=1)) summary(mydag)
abnFit
Print summary of objects of class abnFit
## S3 method for class 'abnFit' summary(object, digits = 3L, ...)
## S3 method for class 'abnFit' summary(object, digits = 3L, ...)
object |
Object of class |
digits |
number of digits of the results. |
... |
additional parameters. Not used at the moment. |
prints summary statistics of the fitted model.
abnMostprobable
Print summary of objects of class abnMostprobable
## S3 method for class 'abnMostprobable' summary(object, ...)
## S3 method for class 'abnMostprobable' summary(object, ...)
object |
Object of class |
... |
additional parameters. Not used at the moment. |
prints the mostprobable consensus DAG and the number of observations used to calculate it.
Given a matrix defining a DAG create a text file suitable for plotting with graphviz.
toGraphviz(dag, data.df=NULL, data.dists=NULL, group.var=NULL, outfile=NULL, directed=TRUE, verbose=FALSE)
toGraphviz(dag, data.df=NULL, data.dists=NULL, group.var=NULL, outfile=NULL, directed=TRUE, verbose=FALSE)
dag |
a matrix defining a DAG. |
data.df |
a data frame containing the data used for learning the network. |
data.dists |
a list with named arguments matching the names of the data frame which gives the distribution family for each variable. See |
group.var |
only applicable for mixed models and gives the column name in |
outfile |
a character string giving the filename which will contain the graphviz graph. |
directed |
logical; if TRUE, a directed acyclic graph is produced, otherwise an undirected graph. |
verbose |
if TRUE more output is printed. If TRUE and 'outfile=NULL' the '.dot' file is printed to console. |
Graphviz (https://www.graphviz.org) is a visualisation software developed by AT&T and freely available.
This function creates a text representation of the DAG, or the undirected graph, so this can be plotted using graphviz.
The R package, Rgraphviz
(available through the Bioconductor project https://www.bioconductor.org/) interfaces R and the working installation of graphviz.
Binary nodes will appear as squares, Gaussian as ovals and Poisson nodes as diamonds in the resulting graphviz network diagram. There are many other shapes possible for nodes and numerous other visual enhancements - see online graphviz documentation.
Bespoke refinements can be added by editing the raw outfile produced. For full manual editing, particularly of the layout, or adding annotations, one easy solution is to convert a postscript format graph (produced in graphviz using the -Tps switch) into a vector format using a tool such as pstoedit (http://www.pstoedit.net/), and then edit using a vector drawing tool like xfig. This can then be resaved as postscript or pdf thus retaining full vector quality.
Nothing is returned, but a file outfile
written.
Fraser Iain Lewis
Marta Pittavino
## On a typical linux system the following code constructs a nice ## looking pdf file 'graph.pdf'. ## Not run: ## Subset of a build-in dataset mydat <- ex0.dag.data[,c("b1","b2","b3","g1","b4","p2","p4")] ## setup distribution list for each node mydists <- list(b1="binomial", b2="binomial", b3="binomial", g1="gaussian", b4="binomial", p2="poisson", p4="poisson") ## specify DAG model mydag <- matrix(c( 0,1,0,0,1,0,0, # 0,0,0,0,0,0,0, # 0,1,0,0,1,0,0, # 1,0,0,0,0,0,1, # 0,0,0,0,0,0,0, # 0,0,0,1,0,0,0, # 0,0,0,0,1,0,0 # ), byrow=TRUE, ncol=7) colnames(mydag) <- rownames(mydag) <- names(mydat) ## create file for processing with graphviz outfile <- paste(tempdir(), "graph.dot", sep="/") toGraphviz(dag=mydag, data.df=mydat, data.dists=mydists, outfile=outfile) ## and then process using graphviz tools e.g. on linux if(Sys.info()[["sysname"]] == "Linux" && interactive()) { system(paste( "dot -Tpdf -o graph.pdf", outfile)) system("evince graph.pdf") } ## Example using data with a group variable where b1<-b2 mydag <- matrix(c(0,1, 0,0), byrow=TRUE, ncol=2) colnames(mydag) <- rownames(mydag) <- names(ex3.dag.data[,c(1,2)]) ## specific distributions mydists <- list(b1="binomial", b2="binomial") ## create file for processing with graphviz outfile <- paste0(tempdir(), "/graph.dot") toGraphviz(dag=mydag, data.df=ex3.dag.data[,c(1,2,14)], data.dists=mydists, group.var="group", outfile=outfile, directed=FALSE) ## and then process using graphviz tools e.g. on linux: if(Sys.info()[["sysname"]] == "Linux" && interactive()) { pdffile <- paste0(tempdir(), "/graph.pdf") system(paste("dot -Tpdf -o ", pdffile, outfile)) system(paste("evince ", pdffile, " &")) ## or some other viewer } ## End(Not run)
## On a typical linux system the following code constructs a nice ## looking pdf file 'graph.pdf'. ## Not run: ## Subset of a build-in dataset mydat <- ex0.dag.data[,c("b1","b2","b3","g1","b4","p2","p4")] ## setup distribution list for each node mydists <- list(b1="binomial", b2="binomial", b3="binomial", g1="gaussian", b4="binomial", p2="poisson", p4="poisson") ## specify DAG model mydag <- matrix(c( 0,1,0,0,1,0,0, # 0,0,0,0,0,0,0, # 0,1,0,0,1,0,0, # 1,0,0,0,0,0,1, # 0,0,0,0,0,0,0, # 0,0,0,1,0,0,0, # 0,0,0,0,1,0,0 # ), byrow=TRUE, ncol=7) colnames(mydag) <- rownames(mydag) <- names(mydat) ## create file for processing with graphviz outfile <- paste(tempdir(), "graph.dot", sep="/") toGraphviz(dag=mydag, data.df=mydat, data.dists=mydists, outfile=outfile) ## and then process using graphviz tools e.g. on linux if(Sys.info()[["sysname"]] == "Linux" && interactive()) { system(paste( "dot -Tpdf -o graph.pdf", outfile)) system("evince graph.pdf") } ## Example using data with a group variable where b1<-b2 mydag <- matrix(c(0,1, 0,0), byrow=TRUE, ncol=2) colnames(mydag) <- rownames(mydag) <- names(ex3.dag.data[,c(1,2)]) ## specific distributions mydists <- list(b1="binomial", b2="binomial") ## create file for processing with graphviz outfile <- paste0(tempdir(), "/graph.dot") toGraphviz(dag=mydag, data.df=ex3.dag.data[,c(1,2,14)], data.dists=mydists, group.var="group", outfile=outfile, directed=FALSE) ## and then process using graphviz tools e.g. on linux: if(Sys.info()[["sysname"]] == "Linux" && interactive()) { pdffile <- paste0(tempdir(), "/graph.pdf") system(paste("dot -Tpdf -o ", pdffile, outfile)) system(paste("evince ", pdffile, " &")) ## or some other viewer } ## End(Not run)