The batch system Open Grid Scheduler/Grid Engine

The compute nodes are usable with a batch system only.

dirac uses the Open Grid Scheduler/Grid Engine (http://gridscheduler.sourceforge.net/), formerly known as Sun Grid Engine (SGE).

Programs will be started with the help of job scripts. These scripts will be submitted with the command qsub. The status of the queues and jobs may be seen with qstat. Friends of graphical user interfaces may use the command qmon for all tasks.

The actual and former state of the whole cluster and its parts is shown with Ganglia:

https://dirac-meister.basisit.de/ganglia/?c=dirac

 

Queues

All jobs enter a queue after submission.

There are queues with different features:

Queue      max.      CPU         Node       Remarks
name  	   run time  cores       count 
-----------------------------------------------------------------------------------------------------
all.q	    24 h      1088        17        for tests on interlagos nodes (64 cores/node)
magny	   168 h       432         9        Magny Cours nodes (48 cores/node)
inter	   168 h      1024        17        Interlagos nodes (64 cores/node)
internode 168 h 1024 17 Interlagos nodes (64 cores/node) for exclusive node usage
max 672 h 128 2 Interlagos nodes (64 cores/node) by special request only
epyc 168 h 128 1 Ryzen Epyc node (64/128 cores/node)


The parameter vf= for RAM usage is mandatory in all job scripts. The queue may be given as a parameter with qsub. If it is missing, the Open Grid Scheduler/Grid Engine will choose one.

The number of active jobs per user, the RAM usage and the number of used CPU cores (aka slots) are limited.

Jobs at the queues inter, internode and magny may be terminated after 24 h for maintainance, enhancement works or configuration changes. 

The queue max is available by special request only. It is intended as an exception for long running programs (four weeks maximum).

Jobs in the queue internode will be started on an empty node only and will remain undisturbed by other jobs if they occupy all cpus after their start quickly.

 

Job scripts

Jobs are described with job scripts. These are executable shell scripts in which the desired variables are set and the programs are called. The Open Grid Engine itself will pass parameters to specific comment lines.

A small example for a job script named /home/dmf/Jobs/job.sh:

#! /bin/bash
#$ -S /bin/bash
#$ -cwd
#$ -l vf=1G
/home/dmf/SRC/Thread/twod 4

This job may be queued on dirac-meister with the commands

cd /home/dmf/Jobs
qsub job.sh

into the default queue all.q. The programm /home/dmf/SRC/Thread/twod with its arguments 4 should be run on one computing node in this example. /home/dmf/Jobs/will be the default environment, given with the option -cwd.

-l vf=1G specifies the maximum usage of 1 GB RAM for the job.

The intended maximal RAM usage has to be given with the statement vf=. It is not possible to reserve more RAM then the maximal available RAM. Without this parameter the job will not run anyhow.

Jobs will not start, if they try to reserve resources which are not available in the queue.

The Grid Engine monitors the other jobs and the current load on the node and then decides whether a job is started. Then, control is largely impossible. Our sample job starts therefore only if the resources of 1 GB RAM and the required number of CPU cores are free (by determining the current CPU load on the node). The Grid Engine in this example will create in /home/dmf/Jobs the following files: 

  • job.sh.oJob-ID program output (stdout)
  • job.sh.eJob-ID program error output (stderr)

An ongoing job ID is assigned to the job by the batch system. The batch parameters can be specified directly with the qsub command together, an overview provides on dirac-meister the command man qsub.

For many jobs, the number of files will quickly become confusing. You can therefore leave those files to another directory by adding the following line into the job file:

#$ -o directory_name
#$ -e directory_name

As directory name a subdirectory in the home directory on dirac-meister may be used, or specify an directory accesible by all compute nodes. Do not use a local directory on the compute nodes, for example, /tmp for this.

 

Troubleshooting starting issues

If jobs are stalled or in fail state, you may get the reason for this with the commands

qalter -w v jobID

qstat -j jobID

With qalter you may try to change some parameters after submission. If you are lucky, the stalled job may start after you changes. All these options are usable within the graphical command qmon, too.

 

Start of parallelized programs (with processes or threads)

Also parallel executable programs (for example, with OpenMPI) are started with the batch system. It reserved and initializes the parallel environment or the necessary resources.

The desired parallel environment must be selected before submitting the job by executing the appropriate initialization files. Thus, the appropriate environment will be set up so that the correct commands and libraries can be found.

For OpenMPI 1.8.8. with RDMA using the Intel compiler an initialization file may look like this:

#
# Hier openMPI 1.8.8 mit RDMA und den Intel-Compilern
#
setenv PATH /opt/openmpi-intel-1.8.8-rdma/bin/:${PATH}
setenv LD_LIBRARY_PATH /opt/openmpi-intel-1.8.8-rdma/lib64/:/opt/intel/lib/intel64/:${LD_LIBRARY_PATH}

These initialization commands must be performed prior to the compilation of your programs. Only in this way the desired compiler include files and libraries during compilation and later the matching parallel environment are used.

The example program mpi_hello.c:

// Ein einfacher MPI-Test
// Übersetzen mit:
// $ mpicc -o mpi_hello mpi_hello.c

// modifiziert um einige sinnlose Berechnungen, um mehr
// Rechenzeit zu verbrauchen
// Dr. Andreas Tomiak, 14.04.2016
//
#include <stdio.h>
#include <mpi.h>

#include <stdlib.h>
/* 100000 Elemente */
#define MAX 1000000
/* 100 Runden */
#define RUNDEN 100

/* ein Array von großen zu kleinen Werten */
int test_array1[MAX];
int test_array2[MAX];

/* in umgekehrter Reihenfolge erstellen */
void init_test_array(int *array) {
   int i, j;
   for(i = MAX,j=0; i >= 0; i--,j++)
      array[j] = i;
}

static void *bubble1(void* val) {
   int i, temp, elemente=MAX;
   while(elemente--)
      for(i = 1; i <= elemente; i++)
         if(test_array1[i-1] > test_array1[i]) {
            temp=test_array1[i];
            test_array1[i]=test_array1[i-1];
            test_array1[i-1]=temp;
         }
}

static void *bubble2(void* val) {
   int i, temp, elemente=MAX;
   while(elemente--)
      for(i = 1; i <= elemente; i++)
         if(test_array2[i-1] > test_array2[i]) {
            temp=test_array2[i];
            test_array2[i]=test_array2[i-1];
            test_array2[i-1]=temp;
         }
}


int main (argc, argv)
     int argc;
     char *argv[];
{
  int rank,size;

  int i, j;
  int cnt=RUNDEN;
  init_test_array(test_array1);
  init_test_array(test_array2);

  MPI_Init(&argc,&argv); /* starts MPI */
  MPI_Comm_rank(MPI_COMM_WORLD,&rank); /* get current process id */
  MPI_Comm_size(MPI_COMM_WORLD,&size); /* get number of processes */
  printf("Hello world from process %d of %d\n",rank,size);

  while(cnt--) {
    bubble1(NULL);
    bubble2(NULL);
    }

  sleep(10); 
  MPI_Finalize();
  return 0;
}

mpi_hello.c will be compiled as follows (the working directory is /home/dat/tests/ompi):

cd /home/dat/tests/ompi
setenv PATH /opt/openmpi-intel-1.8.8-rdma/bin/:${PATH}
setenv LD_LIBRARY_PATH /opt/openmpi-intel-1.8.8-rdma/lib64/:/opt/intel/lib/intel64/:${LD_LIBRARY_PATH}
mpicc -o mpi_hello mpi_hello.c -o

The correct version of the compiler wrapper mpicc for openmpi with the Intel compiler will be ensured with the correct values choices $PATH and $LD_LIBRARY_PATH. The possibility to start the binary will be checked with the Loader command

ldd mpi_hello

The output then must contain path values like /opt/openmpi-intel-1.8.8-rdma/lib64/libmpi.so.1 and all libraries must be located correctly.

Now we look at the corresponding job script mpi_hello.sh for the start of mpi_hello:

#!/bin/sh
## Nutzung der Intel-Compiler und openMPI 1.8.8 mit RDMA (Inifiniband)
## Dr. Andreas Tomiak, 01.06.2016
## starten mit: 
## $ qsub -q inter mpi_hello.sh
# alle Umgebungs-Variablen exportieren
#$ -V
# den Jobnamen setzen
#$ -N mpi_hello
# das Arbeitsverzeichnis setzen
#$ -cwd
# stdout und stderr zusammenfassen
#$ -j y
# Verwendeter Speicher
#$ -l vf=10M
## Parallel-Umgebung ompi-t-f (tight und fill_up)
## und Zahl der gewünschten Prozesse anfordern
#$ -pe ompi-t-f  64-640
# resource reservation einschalten
#$ -R y
# Zeitbegrenzungen definieren
# Maximale harte Walltime ist 16 Minuten (danach wird mit KILL beendet)
#$ -l h_rt=00:16:00
# Maximale weiche Walltime ist 15 Minuten (danach wird SIGUSR2 gesendet)
#$ -l s_rt=00:15:00
# Pfad für Binaries und Bibliotheken setzen
# muss mit PATH und LD_LIBRARY_PATH beim Übersetzen übereinstimmen!
# Hier openMPI 1.8.8 mit RDMA und den Intel-Compilern
setenv PATH /opt/openmpi-intel-1.8.8-rdma/bin/:${PATH}
setenv LD_LIBRARY_PATH /opt/openmpi-intel-1.8.8-rdma/lib64/:/opt/intel/lib/intel64/:${LD_LIBRARY_PATH}
# Fehlerhafte hw_loc-Meldungen unterdrücken
setenv HWLOC_HIDE_ERRORS 1
# Hier geht es endlich los, rdma erzwingen (wenn nicht vorhanden, wird dennoch TCP als Ersatz verwendet)
# Limits aufheben (Achtung: Evtl. andere Limits durch die spezielle Umgebung von sge_execd)
unlimit
limit
# Der Befehl mit dem Start des Rechenjobs, alles davor sind Initialisierung und Debugging-Ausgaben
mpirun -v --mca btl ^tcp -np $NSLOTS /home/dat/tests/ompi/mpi_hello &
  • -S /bin/bash defines bash as the used shell.
  • -cwd takes the current directory as working directory for the job.
  • ompi defines one of the possible parallel environments. Here 64 to 640 instances (cores) are wanted.
  • The RAM statement vf=10M is valid for each process, in sum there will be up to 6,4 GB RAM wanted in our example.
  • The argument of mpirun --mca btl ^tcp forces interprocess communication with Infiniband RDMA only.

The Open Grid Scheduler/Grid Engine tries to keep the processes on as few nodes if the memory request permits. If less than 640 cores are free, you get less, and therefore it starts $nslots instances. With a statement -pe ompi 640 the job would not start if less than 640 CPU cores are available for the job.

The policy of job distribution on the computing nodes is defined by the chosen parallel environment (e.g. ompi-t-f).

With the command

qsub -q inter mpi_hello.sh

the job is started now.


Additional to the job output files, the files

  • mpi_hello.poJob-ID for standard output
  • mpi_hello.peJob-ID for error output

will be created.

 

Available libraries for parallel applications

Currently three parallelisation libaries are available:

Library name  Compiler   interprozess communication
----------------------------------------------------
OpenMPI 1.8.8 Gnu        4*10 GBit/s infiniband RDMA
OpenMPI 1.8.8 Intel      1 GBit/s Ethernet
OpenMPI 1.8.8 Intel      4*10 GBit/s infiniband RDMA

To start your processes, every one of these libraries my used with one of the following Open Grid Scheduler/Grid Engine parallel environments:

Name     Job control  Strategy for
         with GE      distribution on nodes
--------------------------------------------
ompi-t-p tight         $pe_slots
ompi-t-f tight         $fill_up
ompi-t-r tight         $round_robin
ompi-p   start only    $pe_slots
ompi-f   start only    $fill_up
ompi-r   start only    $round_robin

For parallel jobs without MPI (e.g. with threads) there is the parallel environment smp. With this parameter you should define the desired number of slots (the number of wanted CPU cores) and avoid any overusage of the computing nodes.

The paths and parameters in the job scripts must be adapted to the compiler and parallel libraries used. The translation of the programs requires accompanying custom commands and parameters. Reported differences may cause unexpected side effects.

In the parallel environments are defined two main parameters for the distribution of the processes on the node on the one hand and binding to the Open Grid Scheduler other hand.

Default for process distribution on the nodes is the method $fill_up, that means a node is filled up with processes until all its slots (usually corresponds to the CPU core number) are occupied. Thereafter, the next node is used. If $round_robin is defined, the processes on the nodes are distributed, one is on the first, the second to the second node, the third available to the third node and so on until all processes distributed. The strategy $pe_slots is to use one node only, so it prevents the distribution to other nodes. This is especially important, if you use the parallel environment smp for programms which are parallelizing on its own or use threads instead.

The other parameter defines a binding to the scheduler. If this is set, the scheduler will monitor the child processes itself (tight coupling), or he lets them after starting on their own. With loose coupling, you must define the parameters for mpirun and the necessary hostfile with the name of the compute nodes ow your own.

The type of application and the compiler used also restricts the choice of possible environments and queues. An example: GROMACS only works with openmpi and you have to compile everything with the gnu compiler only.

 

Array jobs

You may use array jobs, if you want to run many instances of a program with different parameter sets:

#!/bin/bash
#$ -cwd
#$ -l vf=1G
./run_job < input-$SGE_TASK_ID > output-$SGE_TASK_ID

The job script will be started many times and the shell environment variable $SGE_TASK_ID will be incremented as a counter. If you start this example with

qsub -t 1-10 jobscript

it will run ten times, will use the input files input-1 until input-10 for it and will write the result into the files output-1 until output-10.

 

Further information sources

 

N1 Grid Engine 6 User's Guide (PDF): http://www.helmholtz-berlin.de/media/media/angebote/it/anleitungen/817-6117.pdf

MPI- und Thread-Parallelisierung auf dirac (PDF) http://www.helmholtz-berlin.de/media/media/angebote/it/anleitungen/mpi_thread_parallelisierung.pdf

Using MPI, Portable Parallel Programming with the Message-Passing Interface second edition, William Gropp, Ewing Lusk, Anthony Skjellum, MIT Press

MPI forum: http://www.mpi-forum.org

Message Passing Interface (Wikipedia): https://en.wikipedia.org/wiki/Message_Passing_Interface

MPI documents: http://www.mpi-forum.org/docs/docs.html

openMPI: https://www.open-mpi.org