Orchestration for HPC systems

The goal of this guide is to show and guide you on how to handle co-simulation orchestration on high-performance computing systems. We will walk through using a specific tool (Merlin) that has been tested with HELICS co-simulations. This is not the only tool that exists that has this capability and is not a requirement for co-simulation orchestration. One advantage that Merlin has is its ability to interface with HPC systems that have SLURM or Flux as their resource managers.

Definition of “Orchestration” in HELICS

We will define the term “orchestration” within HELICS as workflow and deployment in an HPC environment. This will allow users to define a co-simulation workflow they would like to execute and deploy the co-simulation either on their own machine or in an HPC environment.

Orchestration with Merlin

First you will need to build and install Merlin. This guide will walk through a suggested co-simulation spec using Merlin to launch a HELICS co-simulation. This is not a comprehensive guide on how to use Merlin, but a guide to use Merlin for HELICS co-simulation orchestration. For a full guide on how to use Merlin, please refer to the Merlin tutorial.

Merlin is a distributed task queuing system, designed to allow complex HPC workflows to scale to large numbers of simulations. It is designed to make building, running, and processing large scale HPC workflows manageable. It is not limited to HPC; it can also be set up on a single machine.

Merlin translates a command-line focused workflow into discrete tasks that it will queue up and launch. This workflow is called a specification, spec for short. The spec is separated into multiple sections that is used to describe how to execute the workflow. This workflow is then represented as a directed acyclic graph (DAG) which describes how the workflow executes.

Once the Merlin spec has been created, the main execution logic is contained in the Study step. This step describes how the applications or scripts need to be executed in the command line in order to execute your workflow. The study step is made up of multiple run steps that are represented as the nodes in the DAG.

For a more in-depth explanation on how Merlin works, take a look at their documentation here

Why Merlin

The biggest feature that Merlin will give HELICS users is its ability to deploy co-simulations in an HPC environment. Merlin has the ability to interface with both FLUX and SLURM workload managers that are installed on HPC machines. Merlin will handle the request for resource allocation, and take care of job and task distribution amongst the nodes. Users will not need to know how to use SLURM or FLUX because Merlin will handle all resource allocation calls to the workload manager, the user will only need to provide the number of nodes they need for their study.

Another benefit of using Merlin for HELICS co-simulation is its flexibility to manage complex co-simulations. pyhelics includes functionality to launch HELICS co-simulations using a JSON file that defines the federates of the co-simulation. As of this writing it does not have the ability to analyze the data and launch subsequent co-simulations. In this type of scenario a user could use Merlin to setup a specification that included an analyze step in the Study step of Merlin. The analysis step would determine if another co-simulation was needed and the input to the next co-simulation, and would then proceed to launch the co-simulation with the input generated by the analysis step.

Merlin Specification

A Merlin specification has multiple parts that control how a co-simulation may run. Below we describe how each part can be used in a HELICS co-simulation workflow. For the sake of simplicity we are using the the pi-exchange python example that can be found here. The goal will be to have Merlin launch multiple pi-senders and pi-receivers.

Merlin workflow description and environment

Merlin has a description and an environment block. The description block provides the name and a short description of the study.

description:
  name: Test helics
  description: Juggle helics data

The env block describes the environment that the study will execute in. This is a place where you can set environment variables to control the number of federates you may need in your co-simulation. In this example, N_SAMPLES will be used to describe how many pi-senders and pi-receivers (total federates) we want in our co-simulation.

env:
  variables:
    OUTPUT_PATH: ./helics_juggle_output
    N_SAMPLES: 8

Merlin Step

The Merlin step is the input data generation step. This step describes how to create the initial inputs for the co-simulation so that subsequent steps can use this input to start the co-simulation. Below is how we might describe the Merlin step for our pi-exchange study.

merlin:
  samples:
    generate:
      cmd: |
        python3 $(SPECROOT)/make_samples.py $(N_SAMPLES) $(MERLIN_INFO)
        cp $(SPECROOT)/pireceiver.py $(MERLIN_INFO)
        cp $(SPECROOT)/pisender.py $(MERLIN_INFO)
    file: samples.csv
    column_labels: [FED]

NOTE: samples.csv is generated by make_samples.py. Each line in
samples.csv is a name of one of the json files that is created.

There is a python script called make_samples.py located in the HELICS repository that generates all helics-cli json configs that will be executed by helics-cli that will be used to execute the co-simulations. N_SAMPLES is an environment variable that is set to 8, so in this example 8 pireceivers and 8 pisenders will be created and used in this co-simulations. make_samples.py also outputs the name of each json file to a csv file called samples.csv. samples.csv contains the names of the json files that were generated. The column_labels tag tells Merlin to set each column in samples.csv to [FED]. This means we can use FED as a variable in the study step. Below is an example of one of the json files that is created.

{
  "federates": [
    {
      "directory": ".",
      "exec": "python3 -u pisender.py 0",
      "host": "localhost",
      "name": "pisender0"
    }
  ],
  "name": "pisender0"
}

This json file will then be used as the input file for helics-cli. The helics-cli will be executed in the study step in Merlin which we will go over next.

Study Step

The study step is where Merlin will execute all the steps specified in the block. Each step is denoted by a name and has a run segment. The run segment is where you will tell Merlin what commands need to be executed.

- name: start_federates <-- Name of the step
  description: say Hello
  run:
    cmd: |
      helics run --path=$(FED) <-- execute the HELICS runner for each column in samples.csv
      echo "DONE"

In the example snippet we ask Merlin to execute the json file that was created in the Merlin step. Since the FED variable is a list, this command will get executed for each index in FED.

Full Spec

Below is the full Merlin spec that was created to make 8 pi-receivers and pi-senders and execute it as a Merlin workflow.

description:
  name: Test helics
  description: Juggle helics data

env:
  variables:
    OUTPUT_PATH: ./helics_juggle_output
    N_SAMPLES: 8

merlin:
  samples:
    generate:
      cmd: |
        python3 $(SPECROOT)/make_samples.py $(N_SAMPLES) $(MERLIN_INFO)
        cp $(SPECROOT)/pireceiver.py $(MERLIN_INFO)
        cp $(SPECROOT)/pisender.py $(MERLIN_INFO)
    file: samples.csv
    column_labels: [FED]

study:
  - name: start_federates
    description: say Hello
    run:
      cmd: |
        spack load helics
        helics run --path=$(FED)
        echo "DONE"
  - name: cleanup
    description: Clean up
    run:
      cmd: rm $(SPECROOT)/samples.csv
      depends: [start_federates_*]

DAG of the spec

Finally, we can look at the DAG of the spec to visualize the steps in the Study.

Orchestration Example

An example of orchestrating multiple simulation runs (e.g. Monte Carlo co-simulation) is given in the Advanced Examples Section.