Dagger: Multi-Step Workflows

dagger runs multi-step Jobmon workflows on the cluster. A workflow is described by a single YAML file that lists an ordered series of steps – each step is a bash command, a parallel psimulate simulation, a pytest run, a Python script, or a parameterized notebook. dagger builds one Jobmon workflow from the file, runs the steps in order, and can resume a stopped workflow from where it left off.

Use dagger when you have a pipeline (for example: run simulations, then post-process the results, then render a report) that you want to launch and resume as a single unit. For a single parallel simulation with no surrounding steps, use psimulate directly (see Running simulations in parallel).

Quickstart

Write a workflow file, workflow.yaml:

workflow:
  name: example_workflow
  project: proj_simscience
  queue: all.q
  output_directory: /mnt/team/simulation_science/example_output
  default_environment: my_conda_env
  steps:
    - name: run_sims
      type: simulation
      resources:
        memory_gb: 3
        runtime: "24:00:00"
      args:
        model_specification: /path/to/model_specification.yaml
        branch_configuration: /path/to/branches.yaml
        artifact_path: /path/to/artifact.hdf
    - name: postprocess
      type: python
      resources:
        memory_gb: 10
        runtime: "01:00:00"
      args:
        path: /path/to/scripts/postprocess.py
        keyword_args:
          results_dir: /mnt/team/simulation_science/example_output

Launch it:

dagger run --config workflow.yaml

dagger runs run_sims to completion, then runs postprocess. If the workflow stops partway through (failure, timeout, manual cancel), resume it:

dagger restart /mnt/team/simulation_science/example_output

See How dagger behaves for what restart does and how outputs are laid out on disk.

Workflow YAML schema

Every workflow file has a single top-level workflow: key whose value is a mapping of workflow-level fields plus a steps list.

workflow:
  name: ...
  project: ...
  queue: ...
  output_directory: ...
  default_environment: ...   # optional
  max_attempts: ...          # optional
  steps:
    - ...                    # one or more steps

Workflow-level fields

name, project, queue, and output_directory are each required overall, but may be supplied either in the YAML file or via a CLI override (the CLI value wins). steps must be provided in the file and must not be empty.

Field

Required

Description

name

yes

Workflow name shown in Jobmon. CLI override: --name/-n.

project

yes

Cluster project to charge, e.g. proj_simscience. CLI override: --project/-P.

queue

yes

Cluster queue to submit to, e.g. all.q. CLI override: --queue/-q.

output_directory

yes

Top-level directory for all workflow outputs (relative or absolute). CLI override: --output-directory/-o. See Output directory layout.

default_environment

no

Conda environment used for any step that does not set its own environment. CLI override: --default-environment/-e.

max_attempts

no

Maximum Jobmon attempts per task before it is marked failed. Defaults to 2. CLI override: --max-attempts/-m.

steps

yes

Ordered list of steps. Step name values must be unique.

Step common fields

Every step, regardless of type, accepts:

Field

Required

Description

name

yes

Unique name for the step within the workflow.

resources

yes

Compute resources for the step’s tasks (see below).

environment

no

Conda environment for this step. Overrides default_environment.

The resources block:

Key

Required

Description

memory_gb

yes

Memory request in GB.

runtime

no

Maximum runtime as hh:mm:ss. Default "01:00:00". Quote it so YAML does not parse it as a sexagesimal number.

cores

no

CPU cores to request. Default 1.

project

no

Per-step project override. Falls back to the workflow project.

queue

no

Per-step queue override. Falls back to the workflow queue.

hardware

no

List of hardware types to target, e.g. ["r650"].

requires_archive_node

no

Whether to require an archive node. Default false.

Step types

A step’s type is determined one of two ways:

  • a top-level command: field makes it a bash step; or

  • an explicit type: field selects one of simulation, pytest, python, or notebook.

All non-bash types take their type-specific options under an args: block.

bash

Runs a shell command. Provide command: at the top level of the step (no args block). type: bash is implied and may be omitted.

- name: post_analysis
  command: python scripts/analyze.py --input /results
  environment: analysis_env
  resources:
    memory_gb: 20
    runtime: "02:00:00"
    cores: 2

simulation

Runs a parallel psimulate simulation as a workflow step.

args key

Required

Description

model_specification

yes

Path to the model specification YAML.

branch_configuration

yes

Path to the branch configuration YAML.

artifact_path

no

Path to the data artifact. Overrides any artifact path in the model specification or branch configuration.

backup_freq

no

Backup frequency in seconds. Defaults to 1800 (30 minutes).

sim_verbosity

no

Per-simulation logging verbosity (0, 1, or 2).

- name: model_sims
  type: simulation
  resources:
    memory_gb: 3
    runtime: "24:00:00"
  args:
    model_specification: /path/to/model.yaml
    branch_configuration: /path/to/branches.yaml
    artifact_path: /path/to/artifact.hdf
    backup_freq: 1800
    sim_verbosity: 1

pytest

Runs a pytest suite. Provide at least one of path or k. path may be a single string or a list of strings.

args key

Required

Description

path

one of path/k

Test path(s) to run. A string or a list of strings.

k

one of path/k

-k expression selecting tests by name.

runslow

no

Pass --runslow. Default false.

- name: pre_tests
  type: pytest
  resources:
    memory_gb: 8
    runtime: "01:00:00"
    cores: 4
  args:
    path:
      - tests/unit
      - tests/integration
    runslow: true

python

Runs a Python script.

args key

Required

Description

path

yes

Path to the .py script to run.

positional_args

no

List of positional arguments passed to the script.

keyword_args

no

Mapping of --key value arguments passed to the script.

- name: postprocess
  type: python
  resources:
    memory_gb: 8
    runtime: "00:30:00"
  args:
    path: scripts/postprocess.py
    positional_args:
      - foo
      - bar
    keyword_args:
      input_dir: /mnt/results/model_29
      verbose: true
      num_workers: 4

notebook

Executes a parameterized Jupyter notebook.

args key

Required

Description

path

yes

Path to the input .ipynb.

output_path

yes

Path to write the executed .ipynb.

parameters

no

Mapping of parameters injected into the notebook.

cwd

no

Working directory for execution. Defaults to the parent of path.

- name: post_notebook
  type: notebook
  resources:
    memory_gb: 20
    runtime: "02:00:00"
  args:
    path: notebooks/results.ipynb
    output_path: /mnt/results/run_29/executed/results.ipynb
    parameters:
      model_dir: /mnt/results/run_29
      year: 2020

Running and restarting

# Launch a fresh workflow from a config file.
dagger run --config workflow.yaml

# Resume a stopped workflow from its output directory.
dagger restart /path/to/output_directory

dagger run accepts overrides for any workflow-level field (--name/-n, --project/-P, --queue/-q, --output-directory/-o, --default-environment/-e, --max-attempts/-m). dagger restart takes the output directory as a positional argument and accepts --project/-P, --queue/-q, and --max-attempts/-m overrides.

Run dagger run --help or dagger restart --help for the full option list (including the Slack notification flags --slack-channel / --slack-tag / --no-slack).

How dagger behaves

Steps run sequentially

Steps execute in the order listed, with a full barrier between them: every task in step N must finish before any task in step N+1 starts. Parallelism happens within a step – for example, a simulation step fans out into many parallel draw/seed tasks – but sibling steps never overlap.

Note

This is the key difference from psimulate, which fans a single simulation out across the cluster. If you list two simulation steps, they run one after the other, not concurrently.

Output directory layout

There is a single workflow-level output_directory. Every run writes configuration.yaml and .workflow_args there (used by dagger restart). simulation steps additionally create a <model_name>/<timestamp>/ subdirectory beneath it, where model_name is derived from the artifact (or model specification) – for per-location artifacts this is effectively the location name. The timestamp is recorded in a .build_timestamp marker so that all simulation steps in the run share it; that marker is only written when the workflow contains a simulation step. Other step types write their worker logs and outputs directly under output_directory.

output_directory/
├── configuration.yaml          # the resolved workflow config (for restart)
├── .workflow_args              # persisted Jobmon workflow id (for restart)
├── .build_timestamp            # shared run timestamp (simulation steps only)
└── <model_name>/               # simulation steps only
    └── <timestamp>/            # one simulation step's results
        ├── results/
        ├── sim_backups/
        ├── metadata/
        └── logs/

Restart resumes the whole workflow

dagger restart <output_directory> reloads configuration.yaml and the persisted .workflow_args written by the original run, then resumes the entire Jobmon workflow, skipping tasks that already completed. Tasks that failed or never ran are retried.

There is no per-step or per-task restart: restart always operates on the whole workflow. A task that already succeeded is skipped, so restart cannot be used to deliberately re-run a step that finished successfully.

Warning

Running a second dagger run into a populated directory overwrites it. configuration.yaml and .workflow_args are rewritten on every run, so a second run leaves the first one no longer restartable. And if the workflow contains a simulation step, that second run reuses the persisted .build_timestamp rather than creating a new timestamped run – so it writes into the first run’s <model_name>/<timestamp>/ directory and overwrites those results in place.

To guard against this, dagger run detects a previous run in the target output_directory (via the persisted .workflow_args) and prompts for confirmation before continuing; answering n aborts without changing anything. Still, prefer a fresh output_directory for each new workflow, and use dagger restart (not a second dagger run) to resume an interrupted one.