Difference between revisions of "Slurm"

Revision as of 15:28, 11 April 2018

Copying in progess (Philipp)

SLURM

About

In order to spread the workload of scientific computations on our compute nodes the resource manager SLURM is used.

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.

Different entities of Slurm

From [SLURM website]

SLURM manages the cluster in partitions, which are a set of compute nodes. Note, that partitions may overlap, e.g. one compute node may be in two or more partitions. A node is a physical computer which provides consumable resources: CPUs and Memory. A CPU does not necessarily have to be a physical processor but is more like a virtual CPU to run one single task on. A dual core with hyper threading technology, for instance, would show up as a node with 4 CPUs consisting of two cores with the capability of running two threads on each core. Physical memory is defined in MB.

The following partitions exist in the current setup:

 * remeis: default partition, all machines, timelimit: 14days
 * erosita: only available for selected people involved in the project, timelimit: infinite
 * power: only the newest machines and servers, higher priority than 'remeis' (e.g. if you submit via power you will get the power machines as soon as possible and not compete with 'remeis' jobs but only other jobs submitted to 'power'), timelimit: 1day
 * messier: only messier cluster, also higher priority than 'remeis', timelimit: 7 days 
 * debug: very high priority partition for software development, timelimit: 1h

Quick users tutorial

This tutorial will give you a quick overview over the most important commands. The [SLURM website] provides more detailed information.

Get cluster status

In order to get an overview of the cluster, type

  sinfo

This command offers a variety of options how to format the output. In order to get a detailed output while focusing on the nodes rather than the partitions, type

  sinfo -N -l

An overview over the available partitions can be shown with

  scontrol show partition

an the current queued and running jobs can be displayed using

  squeue

Executing a job in real-time

In order to allocate for instance 1 CPUs and 100MB of memory for real-time work, type

  salloc --mem-per-cpu=100 -n1 bash

Your bash is now connected to the compute nodes. In order to execute a script use the srun-command

  srun my_script.sh

Each srun-command you execute now is interpreted as a job step. The currently running job step of submitted jobs can be displayed using

  squeue -s

You can also start a job by simply use the srun command and specify your requirements. In the following case, srun will allocate 100MB of memory and 1 CPU(s) for 1 task, only for the duration of execution.

  srun --mem-per-cpu=100 -n1 my_script.sh

If resources are available your job will start immediately.

Submitting a job for later execution

The most convenient way is to submit a job script for later execution. The top part of the script contains scheduling information for SLURM, the more information you provide here, the better.

First of all, a job name is specified, followed by a maximum time. If your job exceeds this time, it will be killed. However, do not overestimate too much because short jobs might start earlier. Acceptable time formats include "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". The output file is set to be test-job(jobID).out and the partition to run the job on is "remeis". The sbatch-script itself will not initiate any job but only allocate the resources. The ntasks and mem-per-cpu options advise the SLURM controller that job steps run within the allocation will launch a maximum of number tasks and to provide for sufficient resources.

The srun commands in the job script launch the job steps. The example below thus consists of two job steps. Each of the srun commands may have own requirements concerning memory any may also spawn less tasks than given in the header of the script file. However, the values in the header may never be exceeded!

  #!/bin/bash
  #SBATCH --job-name my_first_job
  #SBATCH --time 05:00
  #SBATCH --output test-job_%A_%a.out
  #SBATCH --error test-job_%A_%a.err
  #SBATCH --partition=remeis
  #SBATCH --ntasks=4
  #SBATCH --mem-per-cpu=100
  srun -l my_script1.sh
  srun -l my_script2.sh

The -l parameter of srun will print the task number in front of a line of stdout/err. You can submit this script by saving it in a file, e.g. my_first_job.slurm, and sumitting it using

  cepheus:~> sbatch my_first_job.slurm 
  Submitted batch job 144

You can check the estimated starting time of your job using

  squeue --start

Submitting a job array

In order to submit a array of jobs with the same requirements you have to modify your script file. The script above is going to spawn 4 jobs each consisting of one srun command. Note the presence of the new environment variable ${SLURM_ARRAY_TASK_ID} which might be useful for your work. In this example we start an isis-script with different input values. You can also simply use different scripts.

  #!/bin/bash                     
  #SBATCH --partition remeis
  #SBATCH --job-name important_job                                                                   
  #SBATCH --ntasks=1
  #SBATCH --time 00:05:00                                                         
  #SBATCH --output /home/dauser/tmp/jobscript_beta.%A_%a.out          
  #SBATCH --error /home/dauser/tmp/jobscript_beta.%A_%a.err          
  #SBATCH --array 0-3
  
  cd /home/user/script/
  
  COMMAND[0]="./sim_script.sl 0.00"                                       
  COMMAND[1]="./sim_script.sl 0.10"                                       
  COMMAND[2]="./sim_script.sl 0.20"                                       
  COMMAND[3]="./sim_script.sl 0.30"                                       
  
  srun /usr/bin/nice -n +19 ${COMMAND[$SLURM_ARRAY_TASK_ID]}

As above, this code might be saved in a file, for example job.slurm can be executed using

  sbatch job.slurm

If you need a specific machine to run your job on, you can use

  #SBATCH --nodelist=leo,draco

If you have a job with high I/O and/or traffic on the network you can limit the number of jobs running simultaneously (to 2 in this example) by

  #SBATCH --array 0-3%2

SLURM will only allocate resources on the given nodes. However, if all nodes in 'nodelist' cannot fulfill the job requirements, SLURM will also allocate other machines.

If you would like to cancel jobs 1, 2 and 3 from job array 20 use

  scancel 20_[1-3]

If you want to cancel the whole array, scancel works as usual

  scancel 20

Note, there is also the option to modify requirements of single jobs later using scontrol update job=101_1 ....

If you have jobs which are dependent on the result of others or if you want a more detailed description concerning job arrays you can find it in the official SLURM manual: [[1]]

Submitting a job array where each command needs to change into a different directory

In order to allow each command of the job array to change into an individual directory (as opposed to all into the same directory as above), modify the script as follows:

  #!/bin/bash                     
  #SBATCH --partition remeis
  #SBATCH --job-name important_job                                                                   
  #SBATCH --ntasks=1
  #SBATCH --time 00:05:00                                                         
  #SBATCH --output /home/dauser/tmp/jobscript_beta.%A_%a.out          
  #SBATCH --error /home/dauser/tmp/jobscript_beta.%A_%a.err          
  #SBATCH --array 0-3
  
  DIR[0]="/home/user/dir1"
  DIR[1]="/home/user/dir2"
  DIR[2]="/userdata/user/dir3"
  DIR[3]="/userdata/user/dir4"
  
  cd ${DIR[$SLURM_ARRAY_TASK_ID]}
  
  COMMAND[0]="./sim_script.sl 0.00"                                       
  COMMAND[1]="./sim_script.sl 0.10"                                       
  COMMAND[2]="./sim_script.sl 0.20"                                       
  COMMAND[3]="./sim_script.sl 0.30"                                       
  
  srun /usr/bin/nice -n +19 ${COMMAND[$SLURM_ARRAY_TASK_ID]}

This also works with paths relative to the directory where the slurm script was submitted.

Using the erosita partition (serpens)

If you are allowed to use the eRosita Partition, contact a SLURM admin (eg. [[2]]). Once your username is added to the list of privileged users, you just have to add

  #SBATCH --partition=erosita
  #SBATCH --account=erosita

to your jobfiles.

Other useful commands

 * sstat Real-time status information of your running jobs

 * sattach <jobid.stepid> Attach to stdI/O of one of your running jobs

 * scancel [OPTIONS...] [job_id[_array_id][.step_id]] [job_id[_array_id][.step_id]...] Cancel the execution of one of your job arrays/jobs/job steps.

 * scontrol Administration tool, you can for example use this to modify the requirements of your jobs. You can for exaple show your jobs show jobs or update the time limit update JobId=  TimeLimit=2.

 * smap graphically view information about Slurm jobs, partitions, and set configurations parameters.

 * sview graphical user interface for those who prefer clicking over typing. X-Server required.

SLURM and MPI

About MPI

MPI (the Message Passing Interface) makes it possible to run parallel processes on CPUs of different hosts. To do so it uses TCP packets to communicate via the normal network connection. Some tasks can profit a lot of using more cores for computation. At Remeis MPICH2 is used for initialisation of MPI tasks which is well supported within Slurm. The process manager is called **pmi2** and is set as default for srun. If an older MPI process manager is needed, for example for older MPI applications used in **torque**, it can be set with

  #SBATCH --mpi=

in the submission script.

  srun --mpi=list

provides a list of supported MPI process managers.

The implementation of MPI for SLang/ISIS is called **SLMPI**.

Best practice for MPI tasks

The usage of MPI might cause continuously high network traffic especially on the host which holds the master process. Please consider this when deciding which nodes are used for the job. It's a good idea to provide servers (e.g. leo or lupus) with the --nodelist= option one of which is then used to hold the master process since nobody is sitting in front of it and trying to use a browser. Additional nodes are allocated automatically by Slurm if required to fit the --ntasks / -n option.

MPI jobs depend on all allocated nodes to be up and running properly, so I'd like to use this opportunity to remind about shutting down/rebooting PCs on your own without any permission can abort a whole MPI job.

Requirements and Tips

To use MPI obviously the application or function used should support MPI. Examples range from programs written in C using some MPI features and compiled with the mpicc compiler to common ISIS-functions such as mpi_emcee or mpi_fit_pars.

Keep in mind that everything in the compiled programs/scripts which is not an MPI compatible function is executed on each node on its own. For example in ISIS with -n 20:

  fit_counts;

would fit the defined function to the dataset 20 times at once. That's not very helpful so think about which tasks should be performed in the actual MPI process. Special care has to be taken if something has to be saved as a file. Consider:

  save_par("test.par");

with -n 20. This would save the current fit parameters to **test.par** in the working directory 20 times at exactly the same time. This might be helpful if the file is needed on the scratch disk of each node, but doing this on for example /userdata can cause serious trouble. The function mpi_master_only can be used to perform a user defined task in an MPI job only once. Best way is to only submit an MPI job to Slurm which only contains actual MPI functions. If some models in ISIS are used which output something to stdout or stderr while loading these messages are also generated 20 times since it's loaded in each process individually.

Usage

If the job is a valid MPI process then the submission works exactly like for any other job:

  #!/bin/bash
  #SBATCH --job-name my_first_mpi_job
  #SBATCH ...
  #SBATCH --ntasks=20
  cd /my/working/dir
  srun /usr/bin/nice -n +15 ./my_mpi_script

It might be necessary to set a higher memory usage than for the according non MPI job since some applications try to limit the network traffic by just copying the required data to each node in the first place.

Also make sure that if it is necessary to specify the number of child processes in the application itself, set it to the same as with the --ntasks / -n option in the submission. An example would be the num_slaves qualifier in mpi_emcee.

Note that the srun command does not contain mpiexec or mpirun which were used in older versions of MPI to launch the processes. The processes manager pmi2 is built into Slurm and makes it possible that Slurm itself can initialize the network communication with the srun command only.

Of course it's also possible to run the MPI process directly from the commandline. As an example let's have a look at the calculation of pi with the MPI program cpi. The program comes with the source code of MPICH2 and is compiled in the check rule. It's located in

  /data/system/software/mpich/mpich-3.2/examples

To run the calculation in 10 parallel processes directly from the commandline use:

  [1:11]weber@lynx:/data/system/software/mpich/mpich-3.2/examples> srun -n 10 ./cpi
  Process 0 of 10 is on aquarius
  Process 1 of 10 is on ara
  Process 6 of 10 is on asterion
  Process 2 of 10 is on ara
  Process 8 of 10 is on asterion
  Process 7 of 10 is on asterion
  Process 3 of 10 is on aranea
  Process 5 of 10 is on aranea
  Process 4 of 10 is on aranea
  Process 9 of 10 is on cancer
  pi is approximately 3.1415926544231256, Error is 0.0000000008333325
  wall clock time = 0.010601

As we can see Slurm launched 10 processes distributed to aquarius, ara, asterion, aranea and cancer. Keep in mind that running MPI interactively doesn't really make sense. The best way to go is to write a submission script like explained above and let Slurm handle the initialisation.