[GE users] Job suspended every 60 seconds

Reuti reuti at staff.uni-marburg.de
Tue Mar 15 13:45:53 GMT 2005


Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
> Hi Reuti,
> 
> thanks for the correction.
> 
> We have a bit of an overlapping functionality. Setting the 
> reprioritize_interval to 0 disables
> the reprioritization during the job runtime. However, a job gets started 
> with a certain amount
> of tickets in the beginning.
> The execd uses this initial ticket amount to figure out its nice value. 
> This initial setting can be
> disabled with the cluster setting. Unfortunate the cluster setting is 
> not documented in the man
> pages yet, but will be with the next update.

Okay - I see - Thx - Reuti

> The cluster setting could also be used as a host parameter to enable the 
> reprioritization for
> certain hosts and disable them for others. This is not implemented but 
> could be done. One
> can already specify the reprioritize for a given host, but it is ignored.
> 
> Stephan
> 
> Reuti wrote:
> 
>> Stephan, small typo I guess - should be:
>>
>> reprioritize false
>>
>>
>> But anyway, in libs/sched/sgeee.c I see:
>>
>> bool update_execd = ( reprioritize_interval == 0 || (now >= (past + 
>> reprioritize_interval)));
>>
>> Why not:
>>
>> bool update_execd = ( reprioritize_interval != 0 && (now >= (past + 
>> reprioritize_interval)));
>>
>> since for "reprioritize_interval == 0" the second expression 
>> seemsalways to be true - maybe I'm wrong in this place. But: the man 
>> page sched_conf states, that 0:0:0 is turning it off already.
>>
>> What's now the truth? - Reuti
>>
>>
>> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>>
>>> You should also set the
>>>
>>> reprioritze false
>>>
>>> setting in the cluster configuration (qconf -mconf)
>>>
>>> Stephan
>>>
>>> Reuti wrote:
>>>
>>>> Mmh, the reprioritize_interval in the scheduler is set to 0:0:0, so 
>>>> that SGE is not changing the priority on it's own? For me it's 
>>>> working and I get a nice of 19 on all slave nodes of a parallel job. 
>>>> Are some jobs going to the wrong queue?
>>>>
>>>> My idea was not to renice the job during their execution, but start 
>>>> them already with 19 on all nodes. When their is only one job, they 
>>>> will get the most of the CPU time anyway also with 19.
>>>>
>>>> Cheers - Reuti
>>>>
>>>> Patrice Hamelin wrote:
>>>>
>>>>> Reuti,
>>>>>
>>>>>   I tried to renice the processes with the Priority of the queues 
>>>>> and it shares the processors between processes (66%-33% ratio).  I 
>>>>> also tried to renice the parallel job to the lower 19 priority, but 
>>>>> it is not propagating to the slave nodes.  Only the master node 
>>>>> processes have a higher scheduling priority.
>>>>>
>>>>>   I think I will stick to my suspension scripts since I really need 
>>>>> the higher priority queue to get ALL the processor whenever they 
>>>>> need it.  I would warn the users of the lower priority queue that 
>>>>> the result can be bad if suspension occurs.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Patrice Hamelin wrote:
>>>>>
>>>>>> Reuti,
>>>>>>
>>>>>>   At first I was killing all the user's processes, but there was a 
>>>>>> problem with that.  The suspend script is running as the user ID 
>>>>>> itself, and not as sgeadmin, as I first thought.  The result was 
>>>>>> that the shell running the kill's was killing itself, heading to 
>>>>>> unwanted result.
>>>>>>
>>>>>>   I tested my suspension script with two different MPI codes, one 
>>>>>> that is doing only communication, and another one that computes 
>>>>>> Jacobi integration in parallel.  The main problem in getting the 
>>>>>> PIDs it that I really have to target those two processes that are 
>>>>>> runnning the users code and eating 99% or so of the CPU each. I 
>>>>>> have to suspend only those two processes.  I verified that the 
>>>>>> processes are in T state after the kill -19 command was sent to them.
>>>>>>
>>>>>>   I agree that there may be weird results doing that kind of 
>>>>>> operation on an MPICH job, and I will also test the queue priority 
>>>>>> 19 that you mentionned.  It looks promising!
>>>>>>
>>>>>> Ciao!
>>>>>>
>>>>>> Reuti wrote:
>>>>>>
>>>>>>> Hi Patrice,
>>>>>>>
>>>>>>> I really think, it's not a good idea to suspend a MPICH-GM job. 
>>>>>>> IMO the easier solution would be to have a special cluster queue 
>>>>>>> with a priority of 19 for them. So any other job running on the 
>>>>>>> nodes in another queue with a priority of 0 will get most of the 
>>>>>>> CPU time.
>>>>>>>
>>>>>>> But anyway, if I understand your script in the correct way: you 
>>>>>>> want to suspend all jobs from a user on a node by selecting 
>>>>>>> him/her by the $LOGNAME in top? So the user's name may not appear 
>>>>>>> in any other field at all, and only one job per user per node is 
>>>>>>> the limitation. And: head -2 will list the first two lines, at 
>>>>>>> least I get only two blank lines with it (which platform/OS are 
>>>>>>> you using?).
>>>>>>>
>>>>>>> Whether you decide to use it, or get a special cluster queue for 
>>>>>>> MPICH-GM: better suited is the ps command, because there you can 
>>>>>>> specify a user and output format, hence the complete:
>>>>>>>
>>>>>>> top -b -n1 | grep $LOGNAME | head -2 | awk '{print $1}'
>>>>>>>
>>>>>>> can be:
>>>>>>>
>>>>>>> ps --user $LOGNAME -o pid --no-headers
>>>>>>>
>>>>>>>
>>>>>>> Next enhancement is not to stop each process on its own, but the 
>>>>>>> whole process group (if you have a tight integration according to 
>>>>>>> the Howto for MPICH, which also has a hint for MPICH-GM) [of 
>>>>>>> course: it's untested]:
>>>>>>>
>>>>>>> for proc in `rsh $nodes ps --user $LOGNAME -o pgrp --no-headers | 
>>>>>>> uniq` ; do
>>>>>>>     rsh $nodes kill -19 -- -$proc
>>>>>>> done
>>>>>>>
>>>>>>> If there is only one job on the node, you wouldn't need the loop 
>>>>>>> at all now. Did you verified on the nodes, that your job is 
>>>>>>> really suspended with your script by looking in the e.g. "ps -e 
>>>>>>> f" output for the field STAT which will show T for stopped jobs?
>>>>>>>
>>>>>>>
>>>>>>> Cheers - Reuti
>>>>>>>
>>>>>>>
>>>>>>> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
>>>>>>>
>>>>>>>
>>>>>>>> Reuti,
>>>>>>>>
>>>>>>>>   Thanks for your answer, and sorry not to give enough details.  
>>>>>>>> I am using GE 6.0u1 with MPICH-GM implementation. I found 
>>>>>>>> nothing interesting in the qmaster message file.  My script 
>>>>>>>> simply send SIGSTOP signals to all the MPI processes on all 
>>>>>>>> nodes members of the job.  I tested it with a simple 
>>>>>>>> communication program, but I still have to test it in a real 
>>>>>>>> production environment, next step.  You will find my cript 
>>>>>>>> below.  The "unsuspend" script simply send SIGCONT signal to the 
>>>>>>>> processes.
>>>>>>>>
>>>>>>>>   I fooled the re-suspension by creating a file at the first 
>>>>>>>> suspension.
>>>>>>>>
>>>>>>>> F=/tmp/suspend_MPI_job.$LOGNAME.log
>>>>>>>> touch $F
>>>>>>>>
>>>>>>>> if [ -f $TMPDIR/suspended ];then
>>>>>>>>   echo "`date` Job already suspended; exiting" >> $F
>>>>>>>>   exit
>>>>>>>> fi
>>>>>>>> #
>>>>>>>> # For each node
>>>>>>>> #
>>>>>>>>   for nodes in `cat $TMPDIR/machines | /usr/bin/uniq`
>>>>>>>>   do
>>>>>>>> #
>>>>>>>> # Create a file that contains PIDs of suspended processes
>>>>>>>> #
>>>>>>>>     touch $TMPDIR/$nodes
>>>>>>>>     > $TMPDIR/$nodes
>>>>>>>> #
>>>>>>>> # Determine processes to suspend
>>>>>>>> #
>>>>>>>>     for proc in `rsh $nodes top -b -n1 | grep $LOGNAME | head -2 
>>>>>>>> | awk '{print $1}'`
>>>>>>>>     do
>>>>>>>>       echo "`date` Suspending process $proc on $nodes" >> $F
>>>>>>>>       echo $proc >> $TMPDIR/$nodes
>>>>>>>>       rsh $nodes kill -19 $proc
>>>>>>>>     done
>>>>>>>>   done
>>>>>>>> touch $TMPDIR/suspended
>>>>>>>>
>>>>>>>>
>>>>>>>> Reuti wrote:
>>>>>>>>
>>>>>>>>> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  I wrote a script to suspend MPI job in a queue and include 
>>>>>>>>>> that path in the "suspend method" field of the queue 
>>>>>>>>>> configuration.  My problem is that SGE keeps trying to suspend 
>>>>>>>>>> the job again every minutes, even though I setup my queue like:
>>>>>>>>>>
>>>>>>>>>> suspend_thresholds    load_avg=3.0
>>>>>>>>>> nsuspend              0
>>>>>>>>>> suspend_interval      INFINITY
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It may not be easy to suspend a MPI job at all (which MPI 
>>>>>>>>> implementation?),
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> because of possible timeouts in the communication. What are you 
>>>>>>>>> doing in
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> your
>>>>>>>>
>>>>>>>>> script exactly, which version of SGE and: are there any entries 
>>>>>>>>> in the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> messages
>>>>>>>>
>>>>>>>>> files of the qmaster and/or execd? Just 60 seconds it's just 
>>>>>>>>> like the
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> deafult
>>>>>>>>
>>>>>>>>> notify time - how did you submitted your job? - Reuti
>>>>>>>>>
>>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>>
>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>> For additional commands, e-mail: 
>>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Patrice Hamelin ing, M.Sc.A, CCNA
>>>>>>>> Systems Administrator
>>>>>>>> CLUMEQ Supercomputer Centre
>>>>>>>> McGill University
>>>>>>>> 688 Sherbrooke Street West, Suite 710
>>>>>>>> Montreal, QC, Canada H3A 2S6
>>>>>>>> Tel: 514-398-3344
>>>>>>>> Fax: 514-398-2203
>>>>>>>> http://www.clumeq.mcgill.ca
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: 
>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list