[GE users] Job suspended every 60 seconds

Patrice Hamelin phamelin at clumeq.mcgill.ca
Thu Mar 10 15:22:40 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti,

   I tried to renice the processes with the Priority of the queues and 
it shares the processors between processes (66%-33% ratio).  I also 
tried to renice the parallel job to the lower 19 priority, but it is not 
propagating to the slave nodes.  Only the master node processes have a 
higher scheduling priority.

   I think I will stick to my suspension scripts since I really need the 
higher priority queue to get ALL the processor whenever they need it.  I 
would warn the users of the lower priority queue that the result can be 
bad if suspension occurs.

Thanks.

Patrice Hamelin wrote:
> Reuti,
> 
>   At first I was killing all the user's processes, but there was a 
> problem with that.  The suspend script is running as the user ID itself, 
> and not as sgeadmin, as I first thought.  The result was that the shell 
> running the kill's was killing itself, heading to unwanted result.
> 
>   I tested my suspension script with two different MPI codes, one that 
> is doing only communication, and another one that computes Jacobi 
> integration in parallel.  The main problem in getting the PIDs it that I 
> really have to target those two processes that are runnning the users 
> code and eating 99% or so of the CPU each. I have to suspend only those 
> two processes.  I verified that the processes are in T state after the 
> kill -19 command was sent to them.
> 
>   I agree that there may be weird results doing that kind of operation 
> on an MPICH job, and I will also test the queue priority 19 that you 
> mentionned.  It looks promising!
> 
> Ciao!
> 
> Reuti wrote:
> 
>> Hi Patrice,
>>
>> I really think, it's not a good idea to suspend a MPICH-GM job. IMO 
>> the easier solution would be to have a special cluster queue with a 
>> priority of 19 for them. So any other job running on the nodes in 
>> another queue with a priority of 0 will get most of the CPU time.
>>
>> But anyway, if I understand your script in the correct way: you want 
>> to suspend all jobs from a user on a node by selecting him/her by the 
>> $LOGNAME in top? So the user's name may not appear in any other field 
>> at all, and only one job per user per node is the limitation. And: 
>> head -2 will list the first two lines, at least I get only two blank 
>> lines with it (which platform/OS are you using?).
>>
>> Whether you decide to use it, or get a special cluster queue for 
>> MPICH-GM: better suited is the ps command, because there you can 
>> specify a user and output format, hence the complete:
>>
>> top -b -n1 | grep $LOGNAME | head -2 | awk '{print $1}'
>>
>> can be:
>>
>> ps --user $LOGNAME -o pid --no-headers
>>
>>
>> Next enhancement is not to stop each process on its own, but the whole 
>> process group (if you have a tight integration according to the Howto 
>> for MPICH, which also has a hint for MPICH-GM) [of course: it's 
>> untested]:
>>
>> for proc in `rsh $nodes ps --user $LOGNAME -o pgrp --no-headers | 
>> uniq` ; do
>>     rsh $nodes kill -19 -- -$proc
>> done
>>
>> If there is only one job on the node, you wouldn't need the loop at 
>> all now. Did you verified on the nodes, that your job is really 
>> suspended with your script by looking in the e.g. "ps -e f" output for 
>> the field STAT which will show T for stopped jobs?
>>
>>
>> Cheers - Reuti
>>
>>
>> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
>>
>>
>>> Reuti,
>>>
>>>   Thanks for your answer, and sorry not to give enough details.  I am 
>>> using GE 6.0u1 with MPICH-GM implementation. I found nothing 
>>> interesting in the qmaster message file.  My script simply send 
>>> SIGSTOP signals to all the MPI processes on all nodes members of the 
>>> job.  I tested it with a simple communication program, but I still 
>>> have to test it in a real production environment, next step.  You 
>>> will find my cript below.  The "unsuspend" script simply send SIGCONT 
>>> signal to the processes.
>>>
>>>   I fooled the re-suspension by creating a file at the first suspension.
>>>
>>> F=/tmp/suspend_MPI_job.$LOGNAME.log
>>> touch $F
>>>
>>> if [ -f $TMPDIR/suspended ];then
>>>   echo "`date` Job already suspended; exiting" >> $F
>>>   exit
>>> fi
>>> #
>>> # For each node
>>> #
>>>   for nodes in `cat $TMPDIR/machines | /usr/bin/uniq`
>>>   do
>>> #
>>> # Create a file that contains PIDs of suspended processes
>>> #
>>>     touch $TMPDIR/$nodes
>>>     > $TMPDIR/$nodes
>>> #
>>> # Determine processes to suspend
>>> #
>>>     for proc in `rsh $nodes top -b -n1 | grep $LOGNAME | head -2 | 
>>> awk '{print $1}'`
>>>     do
>>>       echo "`date` Suspending process $proc on $nodes" >> $F
>>>       echo $proc >> $TMPDIR/$nodes
>>>       rsh $nodes kill -19 $proc
>>>     done
>>>   done
>>> touch $TMPDIR/suspended
>>>
>>>
>>> Reuti wrote:
>>>
>>>> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
>>>>
>>>>
>>>>
>>>>>  I wrote a script to suspend MPI job in a queue and include that 
>>>>> path in the "suspend method" field of the queue configuration.  My 
>>>>> problem is that SGE keeps trying to suspend the job again every 
>>>>> minutes, even though I setup my queue like:
>>>>>
>>>>> suspend_thresholds    load_avg=3.0
>>>>> nsuspend              0
>>>>> suspend_interval      INFINITY
>>>>
>>>>
>>>>
>>>> It may not be easy to suspend a MPI job at all (which MPI 
>>>> implementation?),
>>>
>>>
>>>> because of possible timeouts in the communication. What are you 
>>>> doing in
>>>
>>>
>>> your
>>>
>>>> script exactly, which version of SGE and: are there any entries in the
>>>
>>>
>>> messages
>>>
>>>> files of the qmaster and/or execd? Just 60 seconds it's just like the
>>>
>>>
>>> deafult
>>>
>>>> notify time - how did you submitted your job? - Reuti
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>> -- 
>>> Patrice Hamelin ing, M.Sc.A, CCNA
>>> Systems Administrator
>>> CLUMEQ Supercomputer Centre
>>> McGill University
>>> 688 Sherbrooke Street West, Suite 710
>>> Montreal, QC, Canada H3A 2S6
>>> Tel: 514-398-3344
>>> Fax: 514-398-2203
>>> http://www.clumeq.mcgill.ca
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 

-- 
Patrice Hamelin ing, M.Sc.A, CCNA
Systems Administrator
CLUMEQ Supercomputer Centre
McGill University
688 Sherbrooke Street West, Suite 710
Montreal, QC, Canada H3A 2S6
Tel: 514-398-3344
Fax: 514-398-2203
http://www.clumeq.mcgill.ca

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list