[GE users] SGE unable to suspend MPI-jobs - serial jobs are working

Tuomo Kalliokoski Tuomo.Kalliokoski at uku.fi
Wed Jul 2 11:40:00 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello everybody,

We are running Sun Grid Engine (SGE) version 6.0 and trying to setup the 
system like the use case 3 ("priorization with preemption") in document

   http://www.sun.com/blueprints/1005/819-4325.pdf

I've succesfully configured the immediate queue and background queue. 
System works and SGE suspends background jobs when needed. However, only 
serial jobs are actually suspended. Parallel jobs are marked with "S", 
but they keep running.

The MPI programs are molecular dynamics simulation programs GROMACS 
3.3.3/3.3.1 and AMBER9. I am using OpenMPI version 1.2.6.

I wrote to Rocks-Discuss list and there I got advice to replace openSSH 
with the rsh included with SGE [1]. However, this does not solve the 
problem. It looks to me that job is correctly started on the computing 
node (eight 'sander.MPI' tasks that are actually doing the work):

/opt/gridengine/bin/lx26-amd64/sge_execd
   \_ sge_shepherd-18118 -bg
     \_ bash /opt/gridengine/default/spool/compute-0-0/job_scripts/18118
       \_ /share/apps/openmpi-1.2.6/bin/mpirun -np 8 
/share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ qrsh -inherit -noshell -nostdin -V compute-0-0.local 
/share/apps/openmpi-1.2.6/bin/orted --no-daemonize --bootproxy 1 --name 
0.0.1 --num_procs 2 --vpid_st
            \_ /opt/gridengine/utilbin/lx26-amd64/rsh -n -p 33103 
compute-0-0.local exec '/opt/gridengine/utilbin/lx26-amd64/qrsh_starter' 
'/opt/gridengine/default/s
  \_ sge_shepherd-18118 -bg
    \_ /opt/gridengine/utilbin/lx26-amd64/rshd -i
      \_ /opt/gridengine/utilbin/lx26-amd64/qrsh_starter 
/opt/gridengine/default/spool/compute-0-0/active_jobs/18118.1/1.compute-0-0 
noshell
         \_ /share/apps/openmpi-1.2.6/bin/orted --no-daemonize 
--bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename 
compute-0-0.local --universe tkallio
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI


[1] 
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2008-June/031277.html

Exactly the same problem exists with GROMACS, so I guess it's not AMBER 
related issue.

Thanks in advance for any help,

-- 
Tuomo Kalliokoski, Lic.Sc. (Pharm.)
Department of Pharmaceutical Chemistry
University of Kuopio, Finland

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list