[GE users] Job suspended every 60 seconds

Reuti reuti at staff.uni-marburg.de
Thu Mar 10 16:22:54 GMT 2005


Mmh, the reprioritize_interval in the scheduler is set to 0:0:0, so that 
SGE is not changing the priority on it's own? For me it's working and I 
get a nice of 19 on all slave nodes of a parallel job. Are some jobs 
going to the wrong queue?

My idea was not to renice the job during their execution, but start them 
already with 19 on all nodes. When their is only one job, they will get 
the most of the CPU time anyway also with 19.

Cheers - Reuti

Patrice Hamelin wrote:
> Reuti,
> 
>   I tried to renice the processes with the Priority of the queues and it 
> shares the processors between processes (66%-33% ratio).  I also tried 
> to renice the parallel job to the lower 19 priority, but it is not 
> propagating to the slave nodes.  Only the master node processes have a 
> higher scheduling priority.
> 
>   I think I will stick to my suspension scripts since I really need the 
> higher priority queue to get ALL the processor whenever they need it.  I 
> would warn the users of the lower priority queue that the result can be 
> bad if suspension occurs.
> 
> Thanks.
> 
> Patrice Hamelin wrote:
> 
>> Reuti,
>>
>>   At first I was killing all the user's processes, but there was a 
>> problem with that.  The suspend script is running as the user ID 
>> itself, and not as sgeadmin, as I first thought.  The result was that 
>> the shell running the kill's was killing itself, heading to unwanted 
>> result.
>>
>>   I tested my suspension script with two different MPI codes, one that 
>> is doing only communication, and another one that computes Jacobi 
>> integration in parallel.  The main problem in getting the PIDs it that 
>> I really have to target those two processes that are runnning the 
>> users code and eating 99% or so of the CPU each. I have to suspend 
>> only those two processes.  I verified that the processes are in T 
>> state after the kill -19 command was sent to them.
>>
>>   I agree that there may be weird results doing that kind of operation 
>> on an MPICH job, and I will also test the queue priority 19 that you 
>> mentionned.  It looks promising!
>>
>> Ciao!
>>
>> Reuti wrote:
>>
>>> Hi Patrice,
>>>
>>> I really think, it's not a good idea to suspend a MPICH-GM job. IMO 
>>> the easier solution would be to have a special cluster queue with a 
>>> priority of 19 for them. So any other job running on the nodes in 
>>> another queue with a priority of 0 will get most of the CPU time.
>>>
>>> But anyway, if I understand your script in the correct way: you want 
>>> to suspend all jobs from a user on a node by selecting him/her by the 
>>> $LOGNAME in top? So the user's name may not appear in any other field 
>>> at all, and only one job per user per node is the limitation. And: 
>>> head -2 will list the first two lines, at least I get only two blank 
>>> lines with it (which platform/OS are you using?).
>>>
>>> Whether you decide to use it, or get a special cluster queue for 
>>> MPICH-GM: better suited is the ps command, because there you can 
>>> specify a user and output format, hence the complete:
>>>
>>> top -b -n1 | grep $LOGNAME | head -2 | awk '{print $1}'
>>>
>>> can be:
>>>
>>> ps --user $LOGNAME -o pid --no-headers
>>>
>>>
>>> Next enhancement is not to stop each process on its own, but the 
>>> whole process group (if you have a tight integration according to the 
>>> Howto for MPICH, which also has a hint for MPICH-GM) [of course: it's 
>>> untested]:
>>>
>>> for proc in `rsh $nodes ps --user $LOGNAME -o pgrp --no-headers | 
>>> uniq` ; do
>>>     rsh $nodes kill -19 -- -$proc
>>> done
>>>
>>> If there is only one job on the node, you wouldn't need the loop at 
>>> all now. Did you verified on the nodes, that your job is really 
>>> suspended with your script by looking in the e.g. "ps -e f" output 
>>> for the field STAT which will show T for stopped jobs?
>>>
>>>
>>> Cheers - Reuti
>>>
>>>
>>> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
>>>
>>>
>>>> Reuti,
>>>>
>>>>   Thanks for your answer, and sorry not to give enough details.  I 
>>>> am using GE 6.0u1 with MPICH-GM implementation. I found nothing 
>>>> interesting in the qmaster message file.  My script simply send 
>>>> SIGSTOP signals to all the MPI processes on all nodes members of the 
>>>> job.  I tested it with a simple communication program, but I still 
>>>> have to test it in a real production environment, next step.  You 
>>>> will find my cript below.  The "unsuspend" script simply send 
>>>> SIGCONT signal to the processes.
>>>>
>>>>   I fooled the re-suspension by creating a file at the first 
>>>> suspension.
>>>>
>>>> F=/tmp/suspend_MPI_job.$LOGNAME.log
>>>> touch $F
>>>>
>>>> if [ -f $TMPDIR/suspended ];then
>>>>   echo "`date` Job already suspended; exiting" >> $F
>>>>   exit
>>>> fi
>>>> #
>>>> # For each node
>>>> #
>>>>   for nodes in `cat $TMPDIR/machines | /usr/bin/uniq`
>>>>   do
>>>> #
>>>> # Create a file that contains PIDs of suspended processes
>>>> #
>>>>     touch $TMPDIR/$nodes
>>>>     > $TMPDIR/$nodes
>>>> #
>>>> # Determine processes to suspend
>>>> #
>>>>     for proc in `rsh $nodes top -b -n1 | grep $LOGNAME | head -2 | 
>>>> awk '{print $1}'`
>>>>     do
>>>>       echo "`date` Suspending process $proc on $nodes" >> $F
>>>>       echo $proc >> $TMPDIR/$nodes
>>>>       rsh $nodes kill -19 $proc
>>>>     done
>>>>   done
>>>> touch $TMPDIR/suspended
>>>>
>>>>
>>>> Reuti wrote:
>>>>
>>>>> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
>>>>>
>>>>>
>>>>>
>>>>>>  I wrote a script to suspend MPI job in a queue and include that 
>>>>>> path in the "suspend method" field of the queue configuration.  My 
>>>>>> problem is that SGE keeps trying to suspend the job again every 
>>>>>> minutes, even though I setup my queue like:
>>>>>>
>>>>>> suspend_thresholds    load_avg=3.0
>>>>>> nsuspend              0
>>>>>> suspend_interval      INFINITY
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> It may not be easy to suspend a MPI job at all (which MPI 
>>>>> implementation?),
>>>>
>>>>
>>>>
>>>>> because of possible timeouts in the communication. What are you 
>>>>> doing in
>>>>
>>>>
>>>>
>>>> your
>>>>
>>>>> script exactly, which version of SGE and: are there any entries in the
>>>>
>>>>
>>>>
>>>> messages
>>>>
>>>>> files of the qmaster and/or execd? Just 60 seconds it's just like the
>>>>
>>>>
>>>>
>>>> deafult
>>>>
>>>>> notify time - how did you submitted your job? - Reuti
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>
>>>> -- 
>>>> Patrice Hamelin ing, M.Sc.A, CCNA
>>>> Systems Administrator
>>>> CLUMEQ Supercomputer Centre
>>>> McGill University
>>>> 688 Sherbrooke Street West, Suite 710
>>>> Montreal, QC, Canada H3A 2S6
>>>> Tel: 514-398-3344
>>>> Fax: 514-398-2203
>>>> http://www.clumeq.mcgill.ca
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list