The Branches File
When investigating a research question with the Vivarium framework, it usually becomes necessary to vary aspects of a model specification in order to evaluate the uncertainty of model outputs or to explore different scenarios based on model parameters. Without any extra tooling this would require manually manipulating the model specification file and re-running for each desired change, which would quickly get out of hand. The branch configuration helps us do this in a convenient way. This section will detail the common ways simulations are varied and the different aspects of a branch configuration that help us do this.
Uncertainty
Generating uncertainty for results is a core tenant of IHME and this is no different for simulation science. We are primarily concerned with two kinds of uncertainty in our model – parameter uncertainty and stochastic uncertainty. The branch configuration can help us explore both sources of uncertainty by varying both the input draw of the parameter data and the seed of the simulation’s random number generator.
Parameter Uncertainty
Our simulations primarily rely on results from the Global Burden of Disease (GBD). GBD results are produced with uncertainty represented as draws. Once we have a model we trust, we typically want to capture our uncertainty in the input data by running the simulation model for several different input draws.
Note
A draw is a statistical term related to Bayesian statistics that has a specific meaning in the context of the GBD. The implementation details vary, but the purpose is for some quantity or measure of interest, a draw is a member of a full set of results such that, when taken together, the set of draws describes at least some of the uncertainty surrounding the quantity as a result of the modeling process, data uncertainty, etc. Generally, GBD results are produced in sets of 1000 draws.
To do this, we can use the input_draw_count
key in a branch configuration.
This key refers to an integer that represents the number of different input draws to generate simulations from.
input_draw_count: 10
Note
Instead of, or in addition to, specifying an input_draw_count
, a list of draws can be specified using the
input_draws
key. If input_draw_count
is also specified, the two values must agree, i.e., the
length of the input_draws
list must be the same as input_draw_count
.
When we use this branch configuration along with the original model specification, we’ll launch 10 simulations in parallel, each using a different set of input parameters represented by the draw number.
psimulate run /path/to/model_specification.yaml /path/to/parameter_uncertainty_branches.yaml
Note
psimulate
randomly selects the input draws it uses from the range [0, 999]. The selection
happens without replacement, so specifying an input_draw_count
of 10 guarantees you
10 unique input draws.
Stochastic Uncertainty
Vivarium simulations are probabilistic in nature. They use Monte Carlo sampling techniques to make decisions about who gets sick, who goes to the hospital, who dies, etc. This usage of randomness means our models have to consider the impact of stochastic uncertainty on its outputs.
There are two ways to handle stochastic uncertainty. The first is to increase the size of the population you’re simulating. This will wash out outlier cases that might heavily skew your results. This works fine up to a point, but simulation run time scales directly with the size of the population you’re simulating. Alternatively, you can run multiple simulations with different random seeds and aggregate your results across those simulations. This second approach takes advantage of parallel computing to keep run times under control.
Note
Random seeds are a convenient way to scale up a simulation’s population in parallel. For example, running a simulation with one million simulants and a single random seed is equivalent to running the same simulation with ten thousand people and 100 random seeds. Because simulations specified with different seeds will be run in parallel, the latter run strategy is often preferable.
To run our simulation for multiple random seeds, we use the random_seed_count
key in a
branch configuration. This key specifies an integer that represents the number of
different random seeds to use, each generated randomly and run in a separate simulation.
random_seed_count: 100
When we use this branch configuration along with the original model specification, we’ll launch 100 simulations in parallel, each using a different random seed.
psimulate run /path/to/model_specification.yaml /path/to/stochastic_uncertainty_branches.yaml
Note
Instead of, or in addition to, specifying an random_seed_count
, a list of seeds can be specified using the
random_seeds
key. If random_seed_count
is also specified, the two values must agree, i.e., the
length of the random_seeds
list must be the same as random_seed_count
. Note that random_seeds
values
must be integers in the range [0, 9999].
Combining Draws and Seeds
Since specifying either input draws or random seeds will result in multiple
simulations being run, it is important to understand how branch configurations are
parsed into simulations when both keys are specified. Specifying both an input_draw_count
and a
random_seed_count
will result in a set of input draws and a set of random seeds being independently
generated. Simulations will then be run for each unique combination of input draw and random seed (the
Cartesian product of the two sets).
An example may make this clearer, so consider the following model specification.
input_draw_count: 100
random_seed_count: 10
It combines the two configuration keys we just learned about. Taken separately, the input_draw_count
mapping would
lead to 100 simulations on 100 draws of input data while the random_seed_count
mapping would lead to ten
simulations on with identical input data but a different seed for the random number generation. With both specified,
the result is 1,000 total simulations, one for each member of the Cartesian product of those sets. That is,
we would run ten simulations with the ten random seeds for each of the 100 input data draws.
Configuration Variations
A major function of branch configurations is to enable easy manipulation of the configuration parameters of a model specification. These parameters generally govern interesting features of an intervention, such as its target coverage or efficacy.
Within a branch configuration, you can specify several variations of these parameters to generate different scenarios or examine the sensitivity of a model to changes in a specific parameter. In the following sections we will describe a number of ways you can construct different scenarios and explain how to compute the number of simulations that will be run for a particular branch configuration.
Note
The following examples that alter configuration parameters all lie under a branches
key. This is the only
other top level key (besides input_draw_count
and random_seed_count
) that psimulate
understands
how to parse.
Single Parameter Variation
In order to illustrate the variation of a single parameter, let’s assume you have defined a model specification that includes the expansion of a dietary intervention of egg supplementation and that this intervention is parameterized by the proportion of the population that is recruited into the intervention program. We may want to run simulations on several different proportions. We can easily do this with the following branches file.
branches:
- egg_intervention:
recruitment:
proportion: [0.1, 0.4, 0.8, 1.0]
The branches
block specifies changes to values found in the configuration block of the original model specification
YAML. The block found in the branches file must exactly match the block from the original model specification.
Here, the YAML list [0.1, 0.4, 0.8, 1.0] dictates specific recruitment proportions to be simulated.
Thus, you can expect four separate simulations to be run, one for each variation.
Warning
Varying the time step, start or end time, or the population size of a simulation will make profiling very difficult and runs the risk of breaking our output writing tools.
Interaction with Uncertainty
As touched upon in the section on combining draws and seeds, each of the top level keys in a branch configuration can be independently produce a set of simulations to be run. To find the total set of simulations to be run from a branch configuration file, we need to count the Cartesian product of the top level keys. We’ll use a slight alteration of our intervention configuration as an example.
input_draw_count: 100
random_seed_count: 4
branches:
- egg_intervention:
recruitment:
proportion: [0.1, 0.4, 0.8, 1.0]
This branch configuration will produce 400 simulations. First we consider the space of
configuration parameters the simulation will be run for: one scenario for
each of the four recruitment proportions. For each scenario, we will run a simulation for each combination
of input draw and random seed specified by the input_draw_count
and random_seed_count
keys. So we’ll have:
(Number of input draws) * (Number of random seeds) * (Number of scenarios) = 100 * 4 * 4 = 1600
simulations to run from this branch configuration.
Multi-parameter Variation
Branch configurations really shine when you want to vary a lot of aspects of your model.
Let’s add another parameter to create scenarios along a new dimension. Say, for instance, we were also interested in the implementing the egg intervention by recruiting people only once they pass a certain age threshold. Provided components were available that can implement this, we could add a variety of starting ages to our branches file like so:
input_draw_count: 100
random_seed_count: 4
branches:
- egg_intervention:
recruitment:
proportion: [0.1, 0.4, 0.8, 1.0]
age_start: [10.0, 25.0, 45.0, 65.0]
This will result in scenarios encompassing every combination of recruitment proportion and starting age. Additionally,
it will result in 100 simulations for each one of the scenarios, one for each of the input draws.
This means the total number of simulations is given by (Number of input draws) * (Number of random seeds)
* (Number of recruitment proportions) * (Number of starting ages)
giving a total of 6400 simulations.
Multi-parameter Variation
We can also create scenarios with multiple top-level configurations. Now imagine, we would like to study another dietary intervention of lentils concurrently with the egg supplementation.
input_draw_count: 100
random_seed_count: 4
branches:
- egg_intervention:
recruitment:
proportion: [0.1, 0.4, 0.8, 1.0]
age_start: [10.0, 25.0, 45.0, 65.0]
lentil_intervention:
recruitment:
proportion: [0.1, 0.4, 0.8, 1.0]
age_start: [10.0, 25.0, 45.0, 65.0]
This will result in scenarios encompassing every combination of recruitment proportion and starting age for eggs
combined with each combination of recruitment proportion and starting age for lentils. Additionally, it will result in
100 simulations for each one of the scenarios, one for each of the input draws. This means the
total number of simulations is given by (Number of input draws) * (Number of random seeds)
* (Number of egg recruitment proportions) * (Number of egg starting ages) * (Number of lentil recruitment proportions)
* (Number of egg starting ages)
giving a total of 102,400 simulations. As you can see, it is very easy to create a
dangerously large number of simulations in this manner.
Complex Configurations
Let’s look at a final example with a bit more going on. Note that in our last example branch configuration we ended up with a huge number of simulations - probably more than it is reasonable to run. What if instead of scaling up both interventions in conjunction across the scenarios, we only wanted to scale up egg supplementation, holding lentil supplementation constant, and scale up lentil supplementation, holding egg supplementation constant.
input_draw_count: 100
random_seed_count: 4
branches:
# Egg supplementation
- egg_intervention:
recruitment:
proportion: [0.1, 0.4, 0.8, 1.0]
age_start: [10.0, 25.0, 45.0, 65.0]
lentil_intervention:
recruitment:
proportion: 0.1
age_start: 25.0
# Lentil supplementation
- egg_intervention:
recruitment:
proportion: 0.1
age_start: 25.0
lentil_intervention:
recruitment:
proportion: [0.1, 0.4, 0.8, 1.0]
age_start: [10.0, 25.0, 45.0, 65.0]
The YAML List underneath the branches
key denotes two different simulation scenario branches
each with a set of configuration parameters. We resolve each one of the list
items under the branches
key separately. The first block resolves to a 16 egg supplementation scenarios.
The second block resolves to 16 lentil supplementation scenarios. Thus the entire branches
block resolves to 32
different sets of configuration parameters.
Following the same logic as in the previous section, we compute the total number of simulations to be run as
(Number of input draws) * (Number of random seeds) * (Number of scenarios) = 100 * 4 * 32 = 12,800
.