[GE users] SGE unable to suspend MPI-jobs - serial jobs are working

Tuomo Kalliokoski Tuomo.Kalliokoski at uku.fi
Wed Jul 2 11:40:00 BST 2008

Hello everybody,

We are running Sun Grid Engine (SGE) version 6.0 and trying to setup the 
system like the use case 3 ("priorization with preemption") in document


I've succesfully configured the immediate queue and background queue. 
System works and SGE suspends background jobs when needed. However, only 
serial jobs are actually suspended. Parallel jobs are marked with "S", 
but they keep running.

The MPI programs are molecular dynamics simulation programs GROMACS 
3.3.3/3.3.1 and AMBER9. I am using OpenMPI version 1.2.6.

I wrote to Rocks-Discuss list and there I got advice to replace openSSH 
with the rsh included with SGE [1]. However, this does not solve the 
problem. It looks to me that job is correctly started on the computing 
node (eight 'sander.MPI' tasks that are actually doing the work):

   \_ sge_shepherd-18118 -bg
     \_ bash /opt/gridengine/default/spool/compute-0-0/job_scripts/18118
       \_ /share/apps/openmpi-1.2.6/bin/mpirun -np 8 
         \_ qrsh -inherit -noshell -nostdin -V compute-0-0.local 
/share/apps/openmpi-1.2.6/bin/orted --no-daemonize --bootproxy 1 --name 
0.0.1 --num_procs 2 --vpid_st
            \_ /opt/gridengine/utilbin/lx26-amd64/rsh -n -p 33103 
compute-0-0.local exec '/opt/gridengine/utilbin/lx26-amd64/qrsh_starter' 
  \_ sge_shepherd-18118 -bg
    \_ /opt/gridengine/utilbin/lx26-amd64/rshd -i
      \_ /opt/gridengine/utilbin/lx26-amd64/qrsh_starter 
         \_ /share/apps/openmpi-1.2.6/bin/orted --no-daemonize 
--bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename 
compute-0-0.local --universe tkallio
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI


Exactly the same problem exists with GROMACS, so I guess it's not AMBER 
related issue.

Thanks in advance for any help,

Tuomo Kalliokoski, Lic.Sc. (Pharm.)
Department of Pharmaceutical Chemistry
University of Kuopio, Finland

