Dagger: Multi-Step Workflows
dagger runs multi-step Jobmon workflows on the cluster. A workflow is
described by a single YAML file that lists an ordered series of steps – each
step is a bash command, a parallel psimulate simulation, a pytest run, a
Python script, or a parameterized notebook. dagger builds one Jobmon
workflow from the file, runs the steps in order, and can resume a stopped
workflow from where it left off.
Use dagger when you have a pipeline (for example: run simulations, then
post-process the results, then render a report) that you want to launch and
resume as a single unit. For a single parallel simulation with no surrounding
steps, use psimulate directly (see Running simulations in parallel).
Quickstart
Write a workflow file, workflow.yaml:
workflow:
name: example_workflow
project: proj_simscience
queue: all.q
output_directory: /mnt/team/simulation_science/example_output
default_environment: my_conda_env
steps:
- name: run_sims
type: simulation
resources:
memory_gb: 3
runtime: "24:00:00"
args:
model_specification: /path/to/model_specification.yaml
branch_configuration: /path/to/branches.yaml
artifact_path: /path/to/artifact.hdf
- name: postprocess
type: python
resources:
memory_gb: 10
runtime: "01:00:00"
args:
path: /path/to/scripts/postprocess.py
keyword_args:
results_dir: /mnt/team/simulation_science/example_output
Launch it:
dagger run --config workflow.yaml
dagger runs run_sims to completion, then runs postprocess. If the
workflow stops partway through (failure, timeout, manual cancel), resume it:
dagger restart /mnt/team/simulation_science/example_output
See How dagger behaves for what restart does and how outputs are laid
out on disk.
Workflow YAML schema
Every workflow file has a single top-level workflow: key whose value is a
mapping of workflow-level fields plus a steps list.
workflow:
name: ...
project: ...
queue: ...
output_directory: ...
default_environment: ... # optional
max_attempts: ... # optional
steps:
- ... # one or more steps
Workflow-level fields
name, project, queue, and output_directory are each required
overall, but may be supplied either in the YAML file or via a CLI
override (the CLI value wins). steps must be provided in the file and must
not be empty.
Field |
Required |
Description |
|---|---|---|
|
yes |
Workflow name shown in Jobmon. CLI override: |
|
yes |
Cluster project to charge, e.g. |
|
yes |
Cluster queue to submit to, e.g. |
|
yes |
Top-level directory for all workflow outputs (relative or absolute).
CLI override: |
|
no |
Conda environment used for any step that does not set its own
|
|
no |
Maximum Jobmon attempts per task before it is marked failed. Defaults to
|
|
yes |
Ordered list of steps. Step |
Step common fields
Every step, regardless of type, accepts:
Field |
Required |
Description |
|---|---|---|
|
yes |
Unique name for the step within the workflow. |
|
yes |
Compute resources for the step’s tasks (see below). |
|
no |
Conda environment for this step. Overrides |
The resources block:
Key |
Required |
Description |
|---|---|---|
|
yes |
Memory request in GB. |
|
no |
Maximum runtime as |
|
no |
CPU cores to request. Default |
|
no |
Per-step project override. Falls back to the workflow |
|
no |
Per-step queue override. Falls back to the workflow |
|
no |
List of hardware types to target, e.g. |
|
no |
Whether to require an archive node. Default |
Step types
A step’s type is determined one of two ways:
a top-level
command:field makes it a bash step; oran explicit
type:field selects one ofsimulation,pytest,python, ornotebook.
All non-bash types take their type-specific options under an args: block.
bash
Runs a shell command. Provide command: at the top level of the step (no
args block). type: bash is implied and may be omitted.
- name: post_analysis
command: python scripts/analyze.py --input /results
environment: analysis_env
resources:
memory_gb: 20
runtime: "02:00:00"
cores: 2
simulation
Runs a parallel psimulate simulation as a workflow step.
|
Required |
Description |
|---|---|---|
|
yes |
Path to the model specification YAML. |
|
yes |
Path to the branch configuration YAML. |
|
no |
Path to the data artifact. Overrides any artifact path in the model specification or branch configuration. |
|
no |
Backup frequency in seconds. Defaults to |
|
no |
Per-simulation logging verbosity ( |
- name: model_sims
type: simulation
resources:
memory_gb: 3
runtime: "24:00:00"
args:
model_specification: /path/to/model.yaml
branch_configuration: /path/to/branches.yaml
artifact_path: /path/to/artifact.hdf
backup_freq: 1800
sim_verbosity: 1
pytest
Runs a pytest suite. Provide at least one of path or k. path may be
a single string or a list of strings.
|
Required |
Description |
|---|---|---|
|
one of path/k |
Test path(s) to run. A string or a list of strings. |
|
one of path/k |
|
|
no |
Pass |
- name: pre_tests
type: pytest
resources:
memory_gb: 8
runtime: "01:00:00"
cores: 4
args:
path:
- tests/unit
- tests/integration
runslow: true
python
Runs a Python script.
|
Required |
Description |
|---|---|---|
|
yes |
Path to the |
|
no |
List of positional arguments passed to the script. |
|
no |
Mapping of |
- name: postprocess
type: python
resources:
memory_gb: 8
runtime: "00:30:00"
args:
path: scripts/postprocess.py
positional_args:
- foo
- bar
keyword_args:
input_dir: /mnt/results/model_29
verbose: true
num_workers: 4
notebook
Executes a parameterized Jupyter notebook.
|
Required |
Description |
|---|---|---|
|
yes |
Path to the input |
|
yes |
Path to write the executed |
|
no |
Mapping of parameters injected into the notebook. |
|
no |
Working directory for execution. Defaults to the parent of |
- name: post_notebook
type: notebook
resources:
memory_gb: 20
runtime: "02:00:00"
args:
path: notebooks/results.ipynb
output_path: /mnt/results/run_29/executed/results.ipynb
parameters:
model_dir: /mnt/results/run_29
year: 2020
Running and restarting
# Launch a fresh workflow from a config file.
dagger run --config workflow.yaml
# Resume a stopped workflow from its output directory.
dagger restart /path/to/output_directory
dagger run accepts overrides for any workflow-level field
(--name/-n, --project/-P, --queue/-q, --output-directory/-o,
--default-environment/-e, --max-attempts/-m). dagger restart takes
the output directory as a positional argument and accepts --project/-P,
--queue/-q, and --max-attempts/-m overrides.
Run dagger run --help or dagger restart --help for the full option list
(including the Slack notification flags --slack-channel / --slack-tag /
--no-slack).
How dagger behaves
Steps run sequentially
Steps execute in the order listed, with a full barrier between them: every
task in step N must finish before any task in step N+1 starts. Parallelism
happens within a step – for example, a simulation step fans out into many
parallel draw/seed tasks – but sibling steps never overlap.
Note
This is the key difference from psimulate, which fans a single
simulation out across the cluster. If you list two simulation steps,
they run one after the other, not concurrently.
Output directory layout
There is a single workflow-level output_directory. Every run writes
configuration.yaml and .workflow_args there (used by dagger
restart). simulation steps additionally create a
<model_name>/<timestamp>/ subdirectory beneath it, where model_name is
derived from the artifact (or model specification) – for per-location
artifacts this is effectively the location name. The timestamp is recorded
in a .build_timestamp marker so that all simulation steps in the run share
it; that marker is only written when the workflow contains a simulation
step. Other step types write their worker logs and outputs directly under
output_directory.
output_directory/
├── configuration.yaml # the resolved workflow config (for restart)
├── .workflow_args # persisted Jobmon workflow id (for restart)
├── .build_timestamp # shared run timestamp (simulation steps only)
└── <model_name>/ # simulation steps only
└── <timestamp>/ # one simulation step's results
├── results/
├── sim_backups/
├── metadata/
└── logs/
Restart resumes the whole workflow
dagger restart <output_directory> reloads configuration.yaml and the
persisted .workflow_args written by the original run, then resumes the
entire Jobmon workflow, skipping tasks that already completed. Tasks that
failed or never ran are retried.
There is no per-step or per-task restart: restart always operates on the whole
workflow. A task that already succeeded is skipped, so restart cannot be
used to deliberately re-run a step that finished successfully.
Warning
Running a second dagger run into a populated directory overwrites
it. configuration.yaml and .workflow_args are rewritten on every
run, so a second run leaves the first one no longer restartable. And if the
workflow contains a simulation step, that second run reuses the persisted
.build_timestamp rather than creating a new timestamped run – so it
writes into the first run’s <model_name>/<timestamp>/ directory and
overwrites those results in place.
To guard against this, dagger run detects a previous run in the target
output_directory (via the persisted .workflow_args) and prompts for
confirmation before continuing; answering n aborts without changing
anything. Still, prefer a fresh output_directory for each new
workflow, and use dagger restart (not a second dagger run) to resume
an interrupted one.