[GE users] Job suspended every 60 seconds

Patrice Hamelin phamelin at clumeq.mcgill.ca
Wed Mar 9 13:12:29 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

   Thanks for your answer, and sorry not to give enough details.  I am 
using GE 6.0u1 with MPICH-GM implementation. I found nothing interesting 
in the qmaster message file.  My script simply send SIGSTOP signals to 
all the MPI processes on all nodes members of the job.  I tested it with 
a simple communication program, but I still have to test it in a real 
production environment, next step.  You will find my cript below.  The 
"unsuspend" script simply send SIGCONT signal to the processes.

   I fooled the re-suspension by creating a file at the first suspension.

F=/tmp/suspend_MPI_job.$LOGNAME.log
touch $F

if [ -f $TMPDIR/suspended ];then
   echo "`date` Job already suspended; exiting" >> $F
   exit
fi
#
# For each node
#
   for nodes in `cat $TMPDIR/machines | /usr/bin/uniq`
   do
#
# Create a file that contains PIDs of suspended processes
#
     touch $TMPDIR/$nodes
     > $TMPDIR/$nodes
#
# Determine processes to suspend
#
     for proc in `rsh $nodes top -b -n1 | grep $LOGNAME | head -2 | awk 
'{print $1}'`
     do
       echo "`date` Suspending process $proc on $nodes" >> $F
       echo $proc >> $TMPDIR/$nodes
       rsh $nodes kill -19 $proc
     done
   done
touch $TMPDIR/suspended


Reuti wrote:
> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
> 
> 
>>   I wrote a script to suspend MPI job in a queue and include that path 
>>in the "suspend method" field of the queue configuration.  My problem is 
>>that SGE keeps trying to suspend the job again every minutes, even 
>>though I setup my queue like:
>>
>>suspend_thresholds    load_avg=3.0
>>nsuspend              0
>>suspend_interval      INFINITY
> 
> 
> It may not be easy to suspend a MPI job at all (which MPI implementation?), 
> because of possible timeouts in the communication. What are you doing in your 
> script exactly, which version of SGE and: are there any entries in the messages 
> files of the qmaster and/or execd? Just 60 seconds it's just like the deafult 
> notify time - how did you submitted your job? - Reuti
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

-- 
Patrice Hamelin ing, M.Sc.A, CCNA
Systems Administrator
CLUMEQ Supercomputer Centre
McGill University
688 Sherbrooke Street West, Suite 710
Montreal, QC, Canada H3A 2S6
Tel: 514-398-3344
Fax: 514-398-2203
http://www.clumeq.mcgill.ca

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list