[GE users] SGE unable to suspend MPI-jobs - serial jobs are working

Reuti reuti at Staff.Uni-Marburg.DE
Wed Jul 2 14:53:07 BST 2008


Hi,

Am 02.07.2008 um 12:40 schrieb Tuomo Kalliokoski:

> Hello everybody,
>
> We are running Sun Grid Engine (SGE) version 6.0 and trying to  
> setup the system like the use case 3 ("priorization with  
> preemption") in document
>
>   http://www.sun.com/blueprints/1005/819-4325.pdf
>
> I've succesfully configured the immediate queue and background  
> queue. System works and SGE suspends background jobs when needed.  
> However, only serial jobs are actually suspended. Parallel jobs are  
> marked with "S", but they keep running.
>
> The MPI programs are molecular dynamics simulation programs GROMACS  
> 3.3.3/3.3.1 and AMBER9. I am using OpenMPI version 1.2.6.

http://gridengine.sunsource.net/servlets/ReadMsg? 
listName=users&msgNo=25093

-- Reuti


> I wrote to Rocks-Discuss list and there I got advice to replace  
> openSSH with the rsh included with SGE [1]. However, this does not  
> solve the problem. It looks to me that job is correctly started on  
> the computing node (eight 'sander.MPI' tasks that are actually  
> doing the work):
>
> /opt/gridengine/bin/lx26-amd64/sge_execd
>   \_ sge_shepherd-18118 -bg
>     \_ bash /opt/gridengine/default/spool/compute-0-0/job_scripts/ 
> 18118
>       \_ /share/apps/openmpi-1.2.6/bin/mpirun -np 8 /share/apps/ 
> openmpi-1.2.6-amber9/exe/sander.MPI
>         \_ qrsh -inherit -noshell -nostdin -V compute-0-0.local / 
> share/apps/openmpi-1.2.6/bin/orted --no-daemonize --bootproxy 1 -- 
> name 0.0.1 --num_procs 2 --vpid_st
>            \_ /opt/gridengine/utilbin/lx26-amd64/rsh -n -p 33103  
> compute-0-0.local exec '/opt/gridengine/utilbin/lx26-amd64/ 
> qrsh_starter' '/opt/gridengine/default/s
>  \_ sge_shepherd-18118 -bg
>    \_ /opt/gridengine/utilbin/lx26-amd64/rshd -i
>      \_ /opt/gridengine/utilbin/lx26-amd64/qrsh_starter /opt/ 
> gridengine/default/spool/compute-0-0/active_jobs/ 
> 18118.1/1.compute-0-0 noshell
>         \_ /share/apps/openmpi-1.2.6/bin/orted --no-daemonize -- 
> bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename  
> compute-0-0.local --universe tkallio
>         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
>         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
>         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
>         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
>         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
>         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
>         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
>         \_ /share/apps/openmpi-1.2.6-amber9/exe/sander.MPI
>
>
> [1] https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2008- 
> June/031277.html
>
> Exactly the same problem exists with GROMACS, so I guess it's not  
> AMBER related issue.
>
> Thanks in advance for any help,
>
> -- 
> Tuomo Kalliokoski, Lic.Sc. (Pharm.)
> Department of Pharmaceutical Chemistry
> University of Kuopio, Finland
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list