Clusters

This package provides some basic support for running an experiment on a HPC. This uses ClusterManagers.jl under the hood.

At the moment, we only support running on a SLURM cluster, but any PRs to support other clusters are welcome.

SLURM

Normally when running on SLURM, one creates a bash script to tell the scheduler about the resource requirements for a job. The following is an example:

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1024
#SBATCH --time=00:30:00
#SBATCH -o hpc/output/test_job_%j.out

The function Experimenter.Cluster.create_slurm_template provides an easy way to create one of these bash scripts with everything you need to run.

Example

Let us take the following end-to-end example. Say that we have an experiment script at my_experiment.jl (contents below), which now initialises the cluster:

using Experimenter

config = Dict{Symbol,Any}(
    :N => IterableVariable([Int(1e6), Int(2e6), Int(3e6)]),
    :seed => IterableVariable([1234, 4321, 3467, 134234, 121]),
    :sigma => 0.0001)
experiment = Experiment(
    name="Test Experiment",
    include_file="run.jl",
    function_name="run_trial",
    configuration=deepcopy(config)
)

db = open_db("experiments.db")

# Init the cluster
Experimenter.Cluster.init()

@execute experiment db DistributedMode

Additionally, we have the file run.jl containing:

using Random
using Distributed
function run_trial(config::Dict{Symbol,Any}, trial_id)
    results = Dict{Symbol, Any}()
    sigma = config[:sigma]
    N = config[:N]
    seed = config[:seed]
    rng = Random.Xoshiro(seed)
    # Perform some calculation
    results[:distance] = sum(rand(rng) * sigma for _ in 1:N)
    results[:num_threads] = Threads.nthreads()
    results[:hostname] = gethostname()
    results[:pid] = Distributed.myid()
    # Must return a Dict{Symbol, Any}, with the data we want to save
    return results
end

We can now create a bash script to run our experiment. We create a template by running the following in the terminal (or adjust or the REPL)

julia --project -e 'using Experimenter; Experimenter.Cluster.create_slurm_template("myrun.sh")'

We then modify the create myrun.sh file to the following:

#!/bin/bash

#SBATCH --ntasks=4
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=1024
#SBATCH --time=00:30:00
#SBATCH -o hpc/logs/job_%j.out

julia --project my_experiment.jl --threads=1

# Optional: Remove the files created by ClusterManagers.jl
rm -fr julia-*.out

Once written, we execute this on the cluster via

sbatch myrun.sh

We can then open a Julia REPL (once the job has finished) to see the results:

using Experimenter
db = open_db("experiments.db")
trials = get_trials_by_name(db, "Test Experiment")

for (i, t) in enumerate(trials)
    hostname = t.results[:hostname]
    id = t.results[:pid]
    println("Trial $i ran on $hostname on worker $id")
end

Support for running on SLURM is based on this gist available on GitHub. This gist also provides information on how to adjust the SLURM script to allow for one GPU to be allocated to each worker.