The Branches File

When investigating a research question with the Vivarium framework, it usually becomes necessary to vary aspects of a model specification in order to evaluate the uncertainty of model outputs or to explore different scenarios based on model parameters. Without any extra tooling this would require manually manipulating the model specification file and re-running for each desired change, which would quickly get out of hand. The branch configuration helps us do this in a convenient way. This section will detail the common ways simulations are varied and the different aspects of a branch configuration that help us do this.

Uncertainty

Generating uncertainty for results is a core tenant of IHME and this is no different for simulation science. We are primarily concerned with two kinds of uncertainty in our model – parameter uncertainty and stochastic uncertainty. The branch configuration can help us explore both sources of uncertainty by varying both the input draw of the parameter data and the seed of the simulation’s random number generator.

Parameter Uncertainty

Our simulations primarily rely on results from the Global Burden of Disease (GBD). GBD results are produced with uncertainty represented as draws. Once we have a model we trust, we typically want to capture our uncertainty in the input data by running the simulation model for several different input draws.

Note

A draw is a statistical term related to Bayesian statistics that has a specific meaning in the context of the GBD. The implementation details vary, but the purpose is for some quantity or measure of interest, a draw is a member of a full set of results such that, when taken together, the set of draws describes at least some of the uncertainty surrounding the quantity as a result of the modeling process, data uncertainty, etc. Generally, GBD results are produced in sets of 1000 draws.

To do this, we can use the input_draw_count key in a branch configuration. This key refers to an integer that represents the number of different input draws to generate simulations from.

parameter_uncertainty_branches.yaml

input_draw_count: 10

Note

Instead of, or in addition to, specifying an input_draw_count, a list of draws can be specified using the input_draws key. If input_draw_count is also specified, the two values must agree, i.e., the length of the input_draws list must be the same as input_draw_count.

When we use this branch configuration along with the original model specification, we’ll launch 10 simulations in parallel, each using a different set of input parameters represented by the draw number.

psimulate run /path/to/model_specification.yaml /path/to/parameter_uncertainty_branches.yaml

Note

psimulate randomly selects the input draws it uses from the range [0, 999]. The selection happens without replacement, so specifying an input_draw_count of 10 guarantees you 10 unique input draws.

Stochastic Uncertainty

Vivarium simulations are probabilistic in nature. They use Monte Carlo sampling techniques to make decisions about who gets sick, who goes to the hospital, who dies, etc. This usage of randomness means our models have to consider the impact of stochastic uncertainty on its outputs.

There are two ways to handle stochastic uncertainty. The first is to increase the size of the population you’re simulating. This will wash out outlier cases that might heavily skew your results. This works fine up to a point, but simulation run time scales directly with the size of the population you’re simulating. Alternatively, you can run multiple simulations with different random seeds and aggregate your results across those simulations. This second approach takes advantage of parallel computing to keep run times under control.

Note

Random seeds are a convenient way to scale up a simulation’s population in parallel. For example, running a simulation with one million simulants and a single random seed is equivalent to running the same simulation with ten thousand people and 100 random seeds. Because simulations specified with different seeds will be run in parallel, the latter run strategy is often preferable.

To run our simulation for multiple random seeds, we use the random_seed_count key in a branch configuration. This key specifies an integer that represents the number of different random seeds to use, each generated randomly and run in a separate simulation.

stochastic_uncertainty_branches.yaml

random_seed_count: 100

When we use this branch configuration along with the original model specification, we’ll launch 100 simulations in parallel, each using a different random seed.

psimulate run /path/to/model_specification.yaml /path/to/stochastic_uncertainty_branches.yaml

Note

Instead of, or in addition to, specifying an random_seed_count, a list of seeds can be specified using the random_seeds key. If random_seed_count is also specified, the two values must agree, i.e., the length of the random_seeds list must be the same as random_seed_count. Note that random_seeds values must be integers in the range [0, 9999].

Combining Draws and Seeds

Since specifying either input draws or random seeds will result in multiple simulations being run, it is important to understand how branch configurations are parsed into simulations when both keys are specified. Specifying both an input_draw_count and a random_seed_count will result in a set of input draws and a set of random seeds being independently generated. Simulations will then be run for each unique combination of input draw and random seed (the Cartesian product of the two sets).

An example may make this clearer, so consider the following model specification.

combined_uncertainty_branches.yaml

input_draw_count: 100
random_seed_count: 10

It combines the two configuration keys we just learned about. Taken separately, the input_draw_count mapping would lead to 100 simulations on 100 draws of input data while the random_seed_count mapping would lead to ten simulations with identical input data but a different seed for the random number generation. With both specified, the result is 1,000 total simulations, one for each member of the Cartesian product of those sets. That is, we would run ten simulations with the ten random seeds for each of the 100 input data draws.

Specifying Specific Draws and Seeds

By default, Vivarium chooses draws and seeds randomly. However, you can specify the draws and/or seeds you want to use by providing a list of integers. For example, to run a simulation using input draws 4 and 8 and random seeds 15, 16, 23, and 42, you can use the following branch configuration:

specific_draws_and_seeds.yaml

input_draw_count: 2
random_seed_count: 4

input_draws: [4, 8]
random_seeds: [15, 16, 23, 42]

It is valid to specify both input_draws and random_seeds (as shown above) or only one of them.

Note

The length of input_draws, if provided, must match the value of input_draw_count. Similarly, the length of random_seeds, if provided, must match the value of random_seed_count.

Configuration Variations

A major function of branch configurations is to enable easy manipulation of the configuration parameters of a model specification. These parameters generally govern interesting features of an intervention, such as its target coverage or efficacy.

Within a branch configuration, you can specify several variations of these parameters to generate different scenarios or examine the sensitivity of a model to changes in a specific parameter. In the following sections we will describe a number of ways you can construct different scenarios and explain how to compute the number of simulations that will be run for a particular branch configuration.

Note

The following examples that alter configuration parameters all lie under a branches key. This is the only other top level key (besides input_draw_count and random_seed_count) that psimulate understands how to parse.

Single Parameter Variation

In order to illustrate the variation of a single parameter, let’s assume you have defined a model specification that includes the expansion of a dietary intervention of egg supplementation and that this intervention is parameterized by the proportion of the population that is recruited into the intervention program. We may want to run simulations on several different proportions. We can easily do this with the following branches file.

egg_intervention_branches.yaml

branches:
  - egg_intervention:
      recruitment:
        proportion: [0.1, 0.4, 0.8, 1.0]

The branches block specifies changes to values found in the configuration block of the original model specification YAML. The block found in the branches file must exactly match the block from the original model specification. Here, the YAML list [0.1, 0.4, 0.8, 1.0] dictates specific recruitment proportions to be simulated. Thus, you can expect four separate simulations to be run, one for each variation.

Warning

Varying the time step, start or end time, or the population size of a simulation will make profiling very difficult and runs the risk of breaking our output writing tools.

Interaction with Uncertainty

As touched upon in the section on combining draws and seeds, each of the top level keys in a branch configuration can be independently produce a set of simulations to be run. To find the total set of simulations to be run from a branch configuration file, we need to count the Cartesian product of the top level keys. We’ll use a slight alteration of our intervention configuration as an example.

egg_intervention_with_parameter_uncertainty_branches.yaml

input_draw_count: 100
random_seed_count: 4

branches:
  - egg_intervention:
      recruitment:
        proportion: [0.1, 0.4, 0.8, 1.0]

This branch configuration will produce 400 simulations. First we consider the space of configuration parameters the simulation will be run for: one scenario for each of the four recruitment proportions. For each scenario, we will run a simulation for each combination of input draw and random seed specified by the input_draw_count and random_seed_count keys. So we’ll have: (Number of input draws) * (Number of random seeds) * (Number of scenarios) = 100 * 4 * 4 = 1600 simulations to run from this branch configuration.

Multi-parameter Variation

Branch configurations really shine when you want to vary a lot of aspects of your model.

Let’s add another parameter to create scenarios along a new dimension. Say, for instance, we were also interested in the implementing the egg intervention by recruiting people only once they pass a certain age threshold. Provided components were available that can implement this, we could add a variety of starting ages to our branches file like so:

egg_intervention_with_ages_branches.yaml

input_draw_count: 100
random_seed_count: 4

branches:
  - egg_intervention:
      recruitment:
        proportion: [0.1, 0.4, 0.8, 1.0]
        age_start: [10.0, 25.0, 45.0, 65.0]

This will result in scenarios encompassing every combination of recruitment proportion and starting age. Additionally, it will result in 100 simulations for each one of the scenarios, one for each of the input draws. This means the total number of simulations is given by (Number of input draws) * (Number of random seeds) * (Number of recruitment proportions) * (Number of starting ages) giving a total of 6400 simulations.

Multi-parameter Variation

We can also create scenarios with multiple top-level configurations. Now imagine, we would like to study another dietary intervention of lentils concurrently with the egg supplementation.

egg__and_lentil_intervention_with_ages_branches.yaml

input_draw_count: 100
random_seed_count: 4

branches:
  - egg_intervention:
      recruitment:
        proportion: [0.1, 0.4, 0.8, 1.0]
        age_start: [10.0, 25.0, 45.0, 65.0]
    lentil_intervention:
      recruitment:
        proportion: [0.1, 0.4, 0.8, 1.0]
        age_start: [10.0, 25.0, 45.0, 65.0]

This will result in scenarios encompassing every combination of recruitment proportion and starting age for eggs combined with each combination of recruitment proportion and starting age for lentils. Additionally, it will result in 100 simulations for each one of the scenarios, one for each of the input draws. This means the total number of simulations is given by (Number of input draws) * (Number of random seeds) * (Number of egg recruitment proportions) * (Number of egg starting ages) * (Number of lentil recruitment proportions) * (Number of egg starting ages) giving a total of 102,400 simulations. As you can see, it is very easy to create a dangerously large number of simulations in this manner.

Complex Configurations

Let’s look at a final example with a bit more going on. Note that in our last example branch configuration we ended up with a huge number of simulations - probably more than it is reasonable to run. What if instead of scaling up both interventions in conjunction across the scenarios, we only wanted to scale up egg supplementation, holding lentil supplementation constant, and scale up lentil supplementation, holding egg supplementation constant.

better_egg_intervention_with_ages_branches.yaml

input_draw_count: 100
random_seed_count: 4

branches:
  # Egg supplementation
  - egg_intervention:
      recruitment:
        proportion: [0.1, 0.4, 0.8, 1.0]
        age_start: [10.0, 25.0, 45.0, 65.0]
    lentil_intervention:
      recruitment:
        proportion: 0.1
        age_start: 25.0
  # Lentil supplementation
  - egg_intervention:
      recruitment:
        proportion: 0.1
        age_start: 25.0
    lentil_intervention:
      recruitment:
        proportion: [0.1, 0.4, 0.8, 1.0]
        age_start: [10.0, 25.0, 45.0, 65.0]

The YAML List underneath the branches key denotes two different simulation scenario branches each with a set of configuration parameters. We resolve each one of the list items under the branches key separately. The first block resolves to a 16 egg supplementation scenarios. The second block resolves to 16 lentil supplementation scenarios. Thus the entire branches block resolves to 32 different sets of configuration parameters.

Following the same logic as in the previous section, we compute the total number of simulations to be run as (Number of input draws) * (Number of random seeds) * (Number of scenarios) = 100 * 4 * 32 = 12,800.