Payu

A climate model workflow manager

Marshall Ward and Aidan Heerdegen

25th May 2021

What is Payu?

Payu is a climate model workflow manager

It is open source, the code is available on GitHub

https://github.com/payu-org/payu

and there is full documentation

https://payu.readthedocs.io/en/latest/

What is a climate model workflow manager?

That means it runs your model for you. In short:

  • Setup model run directory (work)
  • Run the model
  • Move outputs/restarts to archive directory
  • Clean up the run directory
  • Run again (if instructed to do so)

Features

  • Simple YAML based configuration
  • Full automatic version control of experimental configuration using git
  • Hash based versioning of all model inputs and executables
  • Supports many models: MOM5, MOM6, ACCESS-OM, ACCESS-ESM, MITgcm, CICE4, CICE5, Qgcm ...
  • Driver based architecture allows support for other models to be added

Etymology

  • Python on vAYU

    Vayu was Gadi's two times antecedent and though it has passed away...

    ...Payu lives on!

Motivation

Long ago, we managed many, many jobscripts.

Many kinds of scripts:

bash, tcsh, ksh, ...

Many different models:

MOM, MITgcm, Q-GCM, ACCESS, ...

Juggling and sharing scripts was becoming a problem. Many of steps were duplicated between different scripts for the various models. Payu is an attempt to generalise the task of running a model.

image

"Why not just do everything in Python?"

Prehistoric Payu

from payu import gold

def main():
    expt = gold(forcing='andrew')

    expt.setup()
    expt.run()
    expt.archive()
    if expt.counter < expt.max_counter:
        expt.resubmit()

if __name__ == '__main__':
    main()

Using Payu

Load the CMS conda environment containing payu:

module use /g/data3/hh5/public/modules
module load conda

Invoking payu

payu has multiple subcommands. Invoking with the -h option will print helpful usage message

$ payu -h
usage: payu [-h] [--version] {archive,build,collate,ghsetup,init,list,profile,push,run,setup,sweep} ...

positional arguments:
{archive,build,collate,ghsetup,init,list,profile,push,run,setup,sweep}

optional arguments:
-h, --help            show this help message and exit
--version             show program's version number and exit

Subcommands

init Initialise laboratory directories
setup Initialise work directory
sweep Remove ephemeral work directory and copy logs to archive
run Submit model to queue to run
collate Join tiled outputs (and potentially restarts). Not all models

Less used

list List all supported models
archive Clean up work directory and copy files to archive
ghsetup Setup GitHub repository
push Push local repo to GitHub

Never typically used

profile Profile model (not typically used)
build Build model executable (not currently supported)

Pass -h to subcommands for subcommand specific help and options

$ payu run -h
usage: payu run [-h] [--model MODEL_TYPE] [--config CONFIG_PATH] [--initial INIT_RUN] [--nruns N_RUNS] [--laboratory LAB_PATH] [--reproduce] [--force]

Run the model experiment

optional arguments:
-h, --help            show this help message and exit
--model MODEL_TYPE, -m MODEL_TYPE
                        Model type
--config CONFIG_PATH, -c CONFIG_PATH
                        Configuration file path
--initial INIT_RUN, -i INIT_RUN
                        Starting run counter
--nruns N_RUNS, -n N_RUNS
                        Number of successive experiments ro run
--laboratory LAB_PATH, --lab LAB_PATH, -l LAB_PATH
                        The laboratory, this will over-ride the value given in config.yaml
--reproduce, --repro, -r
                        Only run if manifests are correct
--force, -f           Force run to proceed, overwriting existing directories

Clone Experiment

  • The payu configuration file, config.yaml, the model configuration files and manifests are tracked directly by git
  • When an experiment is cloned only the files tracked by git are copied
  • Other files can be added and changes to those files will be automatically added to the experiment repo

Clone an existing experiment (usually in a subdirectory in $HOME):

cd $HOME
mkdir -p payu/mom
cd payu/mom
git clone https://github.com/payu-org/bowl1.git
cd bowl1

This is the "control directory" for bowl1

Your experiment

  • The newly cloned experiment consists of the config.yaml, the config files for this model (MOM) and, optionally, file manifests
bowl1/
   ├── config.yaml
   ├── data_table
   ├── diag_table
   ├── field_table
   ├── input.nml
   └── manifests
      ├── exe.yaml
      ├── input.yaml
      └── restart.yaml

Experiment Configuration

# PBS
queue: express
ncpus: 4
walltime: 0:10:00
mem: 1GB
jobname: bowl1

# Model
model: mom
input: /g/data/hh5/tmp/mom-test/input/bowl1
exe: /g/data/hh5/tmp/mom-test/bin/mom51_solo_default

# Config
collate: False

Run the experiment

  • This job is pre-configured, so can just run it!
cd bowl1
payu run
  • Model will run in work/ (an ephemeral directory created in the laboratory just for that model run)
  • Output moved to archive/ when run completes without error

Inspecting the output

mom.out Model output
mom.err Model error
bowl1.o${jobid} PBS (payu) output
bowl1.e${jobid} PBS (payu) error
archive/output000 Model output files
archive/restart000 Restart (pickup) files
manifests Manifest files for tracking executables and inputs

Cleaning up

To clear work/ and save PBS logs:

payu sweep

Or to completely delete the experiment:

payu sweep --hard

This wipes output, restarts, and logs! (archive/$expt)

Anatomy of an Experiment

Control vs Laboratory

Control path: ${HOME}/mom/bowl1

User-configured (text) input

Laboratory: /scratch/$PROJECT/$USER/$MODEL/

Executables, data input, output, etc.

You "control" the laboratory externally

Laboratory overview

archive Experiment output and restarts
bin Model executables
codebase Model Source code repository
input Static input files
work Ongoing (or failed) experiments

"payu init -m mom" will create these directories

Configuring your experiment

Config Description Default
model Model type (required!)
name Experiment name Expt directory
exe Executable Set by driver
input Model inputs -

Paths can be absolute or relative to the lab path

Scheduler configuration

Config Description Default
queue PBS Queue normal
project SU Account $PROJECT
jobname Queue job name name
walltime Time request (From PBS)
ncpus CPU request 1
mem RAM request Max node mem

qsub_flags for everything else

CPU requests

Normally ncpus will increase itself to match the node, but more control is available:

Config Description Default
platform
→nodesize Node CPUs 48
→nodemem Node RAM 192 (GB)

This is currently required to use the Broadwell nodes:

platform:
   nodesize: 28
   nodemem: 128

The work directory

Run the following to inspect (and test) the setup of your run:

payu setup

This will create your work directory in the laboratory and a symbolic link to it in your control directory.

Inside the work directory

Inspect the symbolic link to work and its contents:

work
├── config.yaml
├── data_table
├── diag_table
├── field_table
├── INPUT
│   ├── gotmturb.inp -> /g/data/hh5/tmp/mom-test/input/bowl1/gotmturb.inp
│   ├── grid_spec.nc -> /g/data/hh5/tmp/mom-test/input/bowl1/grid_spec.nc
│   ├── ocean_barotropic.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_barotropic.res.nc
│   ├── ocean_bih_friction.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_bih_friction.res.nc
│   ├── ocean_density.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_density.res.nc
│   ├── ocean_pot_temp.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_pot_temp.res.nc
│   ├── ocean_sbc.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_sbc.res.nc
│   ├── ocean_solo.res -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_solo.res
│   ├── ocean_temp_salt.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_temp_salt.res.nc
│   ├── ocean_thickness.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_thickness.res.nc
│   ├── ocean_tracer.res -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_tracer.res
│   ├── ocean_velocity_advection.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_velocity_advection.res.nc
│   └── ocean_velocity.res.nc -> /scratch/w97/aph502/mom/archive/bowl1/restart000/ocean_velocity.res.nc
├── input.nml
├── log
├── manifests
│   ├── exe.yaml
│   ├── input.yaml
│   └── restart.yaml
├── mom51_solo_default -> /g/data/hh5/tmp/mom-test/bin/mom51_solo_default
└── RESTART

Your config files are copied, and sometimes modified. Your input data is symlinked.

A simple configuration

The configuration file (config.yaml) uses the YAML format

model: mom6
name: om4_gm_test

queue: normal
jobname: mom6_om4
walltime: 20:00
ncpus: 960
mem: 1500GB

exe: mom6_intel17
input:
    - om4_grid
    - om4_atm

Most variables have "sensible" defaults

A more complex configuration

# PBS configuration
queue: normal
project: fp0
walltime: 02:30:00
jobname: om2_jra55
ncpus: 1153
mem: 2000GB

#platform:
#   nodesize: 28

laboratory: /scratch/fp0/mxw900/cosima
repeat: True

collate:
    walltime: 4:00:00
    mem: 30GB
    ncpus: 4
    queue: express
    flags: -n4 -z -m -r

# Model configuration
model: access
submodels:
    - name: coupler
      model: oasis
      input: oasis_025
      ncpus: 0

    - name: atmosphere
      model: matm
      exe: matm
      #input: jra55-0.8_025
      input: /scratch/v45/mxw900/cosima/nc64
      ncpus: 1

    - name: ocean
      model: mom
      exe: mom
      input:
          - mom
          # - iaf-sw21d
      ncpus: 960

    - name: ice
      model: cice
      exe: cice_nohalo
      input: cice
      ncpus: 192

calendar:
    runtime:
        years: 0
        months: 0
        days: 30
    start:
        year: 1
        month: 1
        days: 1

Feature overview

Multiple runs

  • To do multiple runs in sequence:
payu run -n 20
  • We save every output, and every 5th restart. To change the rate restart files are saved:

    restart_freq: 1
  • To run the model multiple times for each submission to the queue:

    runspersub: 5

will run the model 5 times during each queued job. Can help reduce overhead spent waiting in queue

Path control

Default paths can be set explicitly

shortpath Root ("scratch") path
laboratory Laboratory path
control Control path

e.g. to run under multiple project codes but keep all files in the same laboratory location specify shortpath

MPI support

MPI support is very explicit at the moment:

mpi:
    module: openmpi/2.1.1-debug
    modulepath: /home/157/mxw157/modules
    flags:
       - -mca orte_output_filename log
       - -mca pml yalla
    runcmd: map --profile mpiexec

It is not recommended to change these without understanding and good reason

Userscript support

  • Subcommands and scripts can be injected after key steps
userscripts:
   init: 'echo "some_data" > input.nml'
   setup: patch_inputs.py
   run: 'qsub postprocess.sh'

postscript: sync_output_to_gdata.sh
  • These will run after the prescribed section.
  • postscript runs when a run finishes, if the output is collated it runs after that completes.

Supported models

To see the supported models:

payu list

But expect some atrophy...

File Tracking

  • Track input files used for each model run
  • Completely automatic. User intervention not required
  • Reproducibly re-run previous experiment
  • Share experiments more easily as input files all specified
  • Flexibility with specifying path to input files
  • Identify all runs using specified file (possible future feature)

What is tracked?

Executables manifests/exe.yaml
Inputs manifests/inputs.yaml
Restarts manifests/restarts.yaml

How is it tracked?

  • Uses yamanifest
  • Creates a manifest file which uses YAML format
  • Each file (symlink) in work is a dictionary key in manifest file
  • Manifests files are tracked by git, so the unique hash for every tracked file is associated with each run using version control

Example

  • fullpath is the actual location of the file
  • The hashes uniquely identify file
format: yamanifest
version: 1.0
---
work/mom51_solo_default:
   fullpath: /g/data/hh5/tmp/mom-test/bin/mom51_solo_default
   hashes:
      binhash: 423d9cf92c887fe3145c727c4fbda4d0
      md5: 989580773079ede1109b292c2bb2ad17

Hierachy of hashes

  • yamanifest supports multiple hashes => hierarchy of hashes
  • Unique hashes (md5, sha128) take too long on large files
  • Fast hashing to check for file changes
  • Use unique hash check when necessary
  • Running payu setup for experiments with large number and size of input files can be useful: precalculate expensive hashes saves time when job runs on queue

Forking and sharing experiments

Creating a new experiment

Let's have some FUN and increase the timestep:

git clone bowl1 bowl2
cd bowl2

We are in a hurry, so let's make dt_ocean in input.nml very large:

&ocean_model_nml
    dt_ocean = 86400

Recording your progress

See changes to the run by utilising git:

git log

and responsible people always document their changes:

git commit -am "Testing a large timestep"

But if you're lazy then payu will commit upon completion.

Let's run it!

FAILURE

image

Your run crashed!!!

Inspecting failed jobs

Failed jobs retain output and error files (mom.out, mom.err in this case), and a work/ directory

From mom.err:

FATAL from PE    2: ==>Error: time step instability detected for baroclinic gravity waves in ocean_model_mod

forrtl: error (78): process killed (SIGTERM)

Errors are saved to archive/error_logs with PBS job IDs

(Note: Error logs can get big fast!)

GitHub integration

You can sync your experiment on GitHub:

payu ghsetup
payu push

Visit your experiment in GitHub!

NOTE: This will create SSH keys in your $HOME/.ssh directory.

Other GitHub features

You should set a description for your run:

description: A very fun experiment

You can also save jobs to an organization:

runlog:
   organization: mxw900-raijin

There are a few other features here, and someday they may be documented!

Currently "payu push" is manual, but we could make it automatic.

Coupled Models

Coupled configuration

Yes, Payu supports coupled models!

https://github.com/COSIMA/01deg_jra55_iaf

model: access-om2
input: /g/data/ik11/inputs/access-om2/input_08022019/common_01deg_jra55
submodels:
    - name: atmosphere
      model: yatm
      exe: /g/data/ik11/inputs/access-om2/bin/yatm_4198e150.exe
      input:
            - /g/data/ik11/inputs/access-om2/input_08022019/yatm_01deg
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hr/rsds/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hr/rlds/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hr/prra/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hr/prsn/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/psl/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/land/day/friver/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/tas/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/huss/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/uas/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/atmos/3hrPt/vas/gr/v20190429
            - /g/data/qv56/replicas/input4MIPs/CMIP6/OMIP/MRI/MRI-JRA55-do-1-4-0/landIce/day/licalvf/gr/v20190429
      ncpus: 1

    - name: ocean
      model: mom
      exe: /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_e837d05d_libaccessom2_4198e150.x
      input: /g/data/ik11/inputs/access-om2/input_08022019/mom_01deg
      ncpus: 4358

    - name: ice
      model: cice5
      exe: /g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_597e4561_libaccessom2_4198e150.exe
      input: /g/data/ik11/inputs/access-om2/input_20200422/cice_01deg
      ncpus: 799

Layout of a coupled model

Inputs, work and output are separated by the name:

libom2_1deg
|-- accessom2.nml
|-- archive->/scratch/fp0/mxw900/access-om2/archive/libom2_1deg
|-- atmosphere
|   |-- atm.nml
|   |-- checksums.txt
|   `-- forcing.json
|-- config.yaml
|-- ice
|   |-- cice_in.nml
|   |-- input_ice.nml
|   |-- input_ice_gfdl.nml
|   `-- input_ice_monin.nml
|-- namcouple
|-- ocean
|   |-- checksums.txt
|   |-- data_table
|   |-- diag_table
|   |-- field_table
|   `-- input.nml
|-- sync_output_to_gdata.sh
`-- sync_restarts_to_gdata.sh

Common files (accessom2.nml, namcouple) are in the top directory.

Work directories have a similar structure.

Troubleshooting

Model crashes

If you see this error in your PBS log:

payu: Model exited with error code 134; aborting."

then it means the model crashed and payu has halted execution.

Check your error logs to figure out the problem.

Various Python errors

Sometimes a missing file or misconfigured experiment will cause an error in Python:

Traceback (most recent call last):
  File "/home/157/mxw157/python/payu/bin/payu", line 8, in <module>
    cli.parse()
  File "/home/157/mxw157/python/payu/payu/cli.py", line 61, in parse
    run_cmd(**args)
  File "/home/157/mxw157/python/payu/payu/subcommands/setup_cmd.py", line 18, in runcmd
    expt.setup(force_archive=force_archive)
  File "/home/157/mxw157/python/payu/payu/experiment.py", line 353, in setup
    model.setup()
  File "/home/157/mxw157/python/payu/payu/models/mom.py", line 69, in setup
    super(Mom, self).setup()
  File "/home/157/mxw157/python/payu/payu/models/model.py", line 151, in setup
    shutil.copy(f_path, self.work_path)
  File "/apps/python/2.7.6/lib/python2.7/shutil.py", line 119, in copy
    copyfile(src, dst)
  File "/apps/python/2.7.6/lib/python2.7/shutil.py", line 82, in copyfile
    with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: '/home/157/mxw157/mom/course/bowl2_course/input.nml'

Run payu setup and test that the experiment is set up properly.

Cloning issues

* Some config.yaml contain relative paths for inputs and executables, which can make sharing difficult.

* Either copy (or symlink) their files into your local lab, or modify your config.yaml to use absolute paths.

  • Manifests can largely overcome this issue, but there are subtleties in how to implement this

Summary

What does Payu do?

  • Common interface for many models
  • Manages your inputs, executables, outputs, and restarts
  • Tracks changes to your experiment
  • Provides hooks to configure and control your job
  • Enables sharing of configurations and inputs

Special Thanks

paul nic aidan

Paul Spence suffered through the earliest version of Payu, and helped mold it into a usable application.

Nic Hannah and Aidan Heerdegen have been top-level contributors and maintainers for a very long time.

Thank you!!

Future works

Work only stops when people stop asking for features.

Questions?

What would you like to see?

Even better, contribute!