Torque

From Remeis-Wiki
Jump to navigation Jump to search
Torque

At Remeis Torque has been replaced by Slurm, which should be used instead.

About Torque and Maui

For distributing the workload of scientific calculations to the available compute nodes (a node in the case of the Remeis Cluster is one computer) a resource manager can be used together with a scheduling system. This minimizes the submission effort and maximizes the efficiency by selecting the nodes automatically according to some node allocation policy. In the Remeis Cluster Torque is used together with maui for this purpose.

From [Resources, Inc.]:

"TORQUE Resource Manager provides control over batch jobs and distributed computing resources. It is an advanced open-source product based on the original PBS project* and incorporates the best of both community and professional development. It incorporates significant advances in the areas of scalability, reliability, and functionality and is currently in use at tens of thousands of leading government, academic, and commercial sites throughout the world. TORQUE may be freely used, modified, and distributed under the constraints of the included license."

"Maui Cluster Scheduler, the precursor to Moab Cluster Suite, is an open source job scheduler for clusters and supercomputers. It is an optimized, configurable tool capable of supporting an array of scheduling policies, dynamic priorities, extensive reservations, and fairshare capabilities. It is currently in use at hundreds of government, academic, and commercial sites throughout the world."

Prerequisites

For being able to use the submitting system at the Remeis Cluster you need an account with password-less ssh connections enabled. So you have to be able to login to any of the compute nodes by typing ssh hostname without being ask for your password. If this does not work for you, please read about SSH - Keys first.

Jobscripts

For making use of the scheduling system programs have to be submitted as a jobscript. [Resources, Inc.] nicely explains the principles of job submission as well as the structure and syntax of a jobscript. After writing the jobscript it can be submitted using the [[1]] command. If you want to start real parallel code, like MPI or OpenMP programs, you should write the jobscript yourself. If you just want to submit a batch job consisting of many executions of a serial program the automated job submission described below can be used to generate jobscripts and submit them for you.

A jobscript example.torque.job could for example look like this:

 #!/bin/bash
 #
 #PBS -S /bin/bash -V
 #PBS -t 0-3%10000                   %comment: give the number of how many commands you have, e.g. 4
 #PBS -l nodes=1                     %comment: how many nodes per command you want to use
 #PBS -l walltime=48:00:00           %comment: the time after which torque kills the script, here 48h
 #PBS -N my_cool_fitting             %comment: name that you can see in the torque job-list
 #PBS -o /home/user/torque/cool_fitting.out    %comment: where to write the output
 #PBS -e /home/user/torque/cool_fitting.err    %comment: where to write the error messages
 export HOST=`hostname`
 export HOSTNAME=$HOST
 
 cd /path/where/your/script/is_located/
 
 COMMAND[0]="isis-script cool_fitting_1.sl"
 COMMAND[1]="isis-script cool_fitting_2.sl"
 COMMAND[2]="isis-script cool_fitting_3.sl"
 COMMAND[3]="isis-script cool_fitting_4.sl"
 
 /usr/bin/nice -n +15 ${COMMAND[$PBS_ARRAYID]}
 

The %comment-things should not be put in the script!

Queues

The following queues are available at Remeis (see the outputs of the qstat -q and qstat -Q commands):

 * batch: the standard queue. This assumes that your job does not have large memory requirements but uses all available hosts. Do NOT use this queue if you need more than 1GB of RAM without specifying your memory requirements!
 * leo: large memory, parallel jobs, and so on.
 * loft: only to be used for the LOFT project.
 * devel: high priority queue for software development, therefore the maximum available walltime is limited.
Automated job submission

Often a user wants to run a program on different data sets without having to write a separate jobscript for each execution. This section describes how that can be achieved easily by using the helper scripts qsub_array and qsub_batch.

qsub_array

The command qsub_array can be used to generate a jobscript for an array job containing all the commands found in a command file.

A command file example.com could for example look like this:

 echo Command 1: I am `whoami` on host `hostname` in directory `pwd`
 echo Command 2: I am `whoami` on host `hostname` in directory `pwd`
 echo Command 3: I am `whoami` on host `hostname` in directory `pwd`
 echo Command 4: I am `whoami` on host `hostname` in directory `pwd`
 

Generate jobscript and submit:

 qsub_array example.com
 

A jobscript example.com.job is created in the executing directory and the commands are distributed on the cluster. By default commands should be started on the fastest nodes with the least 1 min load average. Output to stdout is collected and redirected to a output file after job completion. Same applies for stderr. The default location of the output and error files is /tmp/pbs/user, so the output is stored on the local machine from which the job was submitted. The job is executed as the submitting user and starts in its home directory (IMPORTANT: Please use the scratch disk if your job needs significant disk space!). If a job shall be resubmitted this can be done by either invoking qsub_array on the command file again, or by submitting the generated jobscript example.com.job directly using qsub example.com.job. This way the user has the possibility to add extended PBS options to the jobscript or to change the output filenames using qsub's -o /path/to/stdoutfile option which overrides the default setting.

The example job submitted by user Hans would create the output files /tmp/pbs/hans/example.com.out-0 to /tmp/pbs/hans/example.com.out-3 and the corresponding error files /tmp/pbs/hans/example.com.err-0 to /tmp/pbs/hans/example.com.err-3:

 hans@octans:~$ ls /tmp/pbs/hans
 example.com.err-0  example.com.err-1  example.com.err-2  example.com.err-3  example.com.out-0  example.com.out-1  example.com.out-2  example.com.out-3
 
 hans@octans:~$ cat /tmp/pbs/hans/*.out*
 Command 1: I am hans on host phoenix in directory /home/hans
 Command 2: I am hans on host grus in directory /home/hans
 Command 3: I am hans on host hydra in directory /home/hans
 Command 4: I am hans on host piscis in directory /home/hans
 

qsub_array provides some options for requesting specific resources. Some of them should be used for almost any job (e.g. %%--%%walltime), others should be used only if you really know what you are doing (e.g. %%--%%nodes). qsub_array %%--%%help shows the available options:

 Submit array of jobs given in a command file
 Usage:  qsub_array <comfile> [<option1=...> [<option2=...> [...]]]
 Options:
       --help:         Show this help
       -e/--exclude:   Exclude comma seperated list of hosts
       -h/--hostlist:  Require nodes to be a subset of a comma seperated list of hosts
       -l/--limit:     Maximum number of jobs running in parallel
       -m/--memory:    RAM needed by the program
       -N/--name:      Name of the job
       -n/--nice:      Nice level
       --nodes:        Number of nodes to be used by the program
       --ppn:          Number of processors to be used on each node
       -p/--pretend:   Generate jobscript but do not submit
       -q/--queue:     Queue to submit the job to
       -s/--single:    Allow to start only one job per node (blocks other jobs but helpful for PVM jobs)
       -v/--verbose:   More output to stdout for debugging
       -w/--walltime:  Maximum walltime for the program
       -a/--arch:      Specifiy the Architecture ( 32bit (x86) or 64bit (x86_64) )
 
 Syntax of command file:
       Each line from <comfile> is interpreted as a command.
 
 Examples:
       Acquire 2 GB RAM for a job running max. 1h 30min on nice level 18:
       qsub_array joblist.txt --memory=2000M --nice=18 --walltime=01:30:00

//NOTE: Option arguments must be given in the %%--%%option=argument fashion! Memory should be given in MB, disk space in GB though in principle the 'M' and 'G' endings are possible for all size options and should cause no problems.//

At the moment the default walltime (time after which your program is killed if it is still active) is set to 30 min. This is certainly not enough for most production runs. Therefore you have to explicitly request for more time using the %%--%%walltime=hh:mm:ss option. The walltime should be as close as possible to but not less than the real time needed by a program. Due to the heterogeneous nature of our cluster the execution time also depends on the execution hosts and should generally be a little bit overestimated. The given walltime always applies to all jobs of an array.

Your environment variables will be exported to each job such that your job will start in an environment like the one you started the job in but with variables $HOST and $HOSTNAME set to the name of the actual host.

qsub_batch

Most of the time it is most convenient to submit multiple jobs as one job-array. But if the jobs have nothing to do with each other one might consider submitting them separately. Therefore a script qsub_batch exists which does the same as qsub_array but generates and submits one jobscript for each command. qsub_array should generally be preferred especially for large jobs (>100).

Useful commands
    • [[2]]** provides several options to show useful information about your jobs or job queues.
 * qstat shows your queued/running/completed jobs together with their job id
 * qstat -t lists all members of array jobs
 * qstat -n also lists the nodes the jobs are allocated to
 * qstat -q shows the queues and their status
    • [[3]]** submits a jobscript. The requested resources specified on the command line override the values given in the jobscript.
 * qsub -o ~/myoutput.out jobscript.job submits the jobscript jobscript.job and sets the stdout output to ~/myoutput.o
 * qsub -e ~/myoutput.err jobscript.job same as qsub -o but for output to stderr
    • [[4]]** can be used to hold a job. If a user's job limit is already reached, the user can hold a previously submitted job such that a new job can be started immediately.
 * qhold <jobid> holds the corresponding job
    • [[5]]** continues halted jobs.
 * qrls <jobid> releases the hold of the corresponding job.
    • [[6]]** is used to delete queued jobs as well as running jobs.
 * qdel <jobid> deletes the job with the corresponding job id
 * qdel all deletes all jobs of a user
    • [[7]]** sends a signal to a job and can be used to kill zombie jobs which can not be killed by qdel.
 * qsig -s 9 <jobid> sends a SIGKILL to job <idnumber>
    • [[8]]** enables a user to change a job's attributes after job submission. This is for example useful if a job does not start due to unsatisfiable resource requests.
MPI Jobs

The MPI Implementation MPICH2 is available for execution of MPI programs. For the compilation of MPI programs the compilers mpicc, mpic++, mpif77 and mpif90 should be used for programs written in C, C++, Fortran77 and Fortran90 respectively to ensure that the right MPI libraries are linked in. To start a MPI program you will have to precede the program name with mpiexec (//NOTE: DO NOT USE mpirun!//):

  mpiexec program

This can either be done within a jobscript or within a command file which is passed to qsub_array.

Parallel Error Calculation with MPI and SLmpi

The routine mpi_fit_pars from the isisscripts calculates confidence intervals in parallel on different machines. See Parallel Error Calculation with MPI and SLmpi for more information and examples.

PVM Jobs

For using qsub_array or qsub_batch in combination with programs using PVM for parallel computing the %%--%%pvm option has to be specified. PVM daemons are started on the allocated nodes automatically before the program is run. This option automatically sets to other options:

 * %%--%%single: Ensure that only one PVM job is running at one node at the same time
 * %%--%%limit=1: Only one PVM job within an array of PVM jobs can run at the same time

If you really need to overcome these restrictions you will have to write a jobscript by your own.

Interactive Jobs

It is also possible to acquire multiple processors for an interactive session if enough processors are available to start the session immediately. The number of available processors can be obtained by having a look at the [Cluster Status Page]. An interactive session is started by invoking qsub with the -I parameter. The amount of resources needed for the session is given after the -l parameter. The supported types of resources can be seen at the [section of the job submission description]. If you, for example, want start a simple interactive session with X-forwarding enabled:

  qsub -I -X

For parallel jobs you can of course acquire more nodes / processors per node. An interactive session using 2 nodes and an estimated walltime of 10h, for example, can be started with the following command:

  qsub -I -X -l nodes=2,walltime=10:00:00


After starting the interactive session commands can be executed just like within a normal shell but within the environment of a PBS jobscript. So you can use mpiexec just like within a jobscript without having to specify the number of nodes on the command line.

Status of the Remeis Cluster

The [Cluster Status Page] contains some information about the current node usage.

This page only works from within the observatory network. From outside, you can get the same info in an in-shell browser via the command

  clusterStatus 

Just ssh to any of the observatory machines and execute the command.