[GE users] Job suspended every 60 seconds

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Tue Mar 15 12:43:00 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Reuti,

thanks for the correction.

We have a bit of an overlapping functionality. Setting the 
reprioritize_interval to 0 disables
the reprioritization during the job runtime. However, a job gets started 
with a certain amount
of tickets in the beginning.
The execd uses this initial ticket amount to figure out its nice value. 
This initial setting can be
disabled with the cluster setting. Unfortunate the cluster setting is 
not documented in the man
pages yet, but will be with the next update.

The cluster setting could also be used as a host parameter to enable the 
reprioritization for
certain hosts and disable them for others. This is not implemented but 
could be done. One
can already specify the reprioritize for a given host, but it is ignored.

Stephan

Reuti wrote:

> Stephan, small typo I guess - should be:
>
> reprioritize false
>
>
> But anyway, in libs/sched/sgeee.c I see:
>
> bool update_execd = ( reprioritize_interval == 0 || (now >= (past + 
> reprioritize_interval)));
>
> Why not:
>
> bool update_execd = ( reprioritize_interval != 0 && (now >= (past + 
> reprioritize_interval)));
>
> since for "reprioritize_interval == 0" the second expression 
> seemsalways to be true - maybe I'm wrong in this place. But: the man 
> page sched_conf states, that 0:0:0 is turning it off already.
>
> What's now the truth? - Reuti
>
>
> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>
>> You should also set the
>>
>> reprioritze false
>>
>> setting in the cluster configuration (qconf -mconf)
>>
>> Stephan
>>
>> Reuti wrote:
>>
>>> Mmh, the reprioritize_interval in the scheduler is set to 0:0:0, so 
>>> that SGE is not changing the priority on it's own? For me it's 
>>> working and I get a nice of 19 on all slave nodes of a parallel job. 
>>> Are some jobs going to the wrong queue?
>>>
>>> My idea was not to renice the job during their execution, but start 
>>> them already with 19 on all nodes. When their is only one job, they 
>>> will get the most of the CPU time anyway also with 19.
>>>
>>> Cheers - Reuti
>>>
>>> Patrice Hamelin wrote:
>>>
>>>> Reuti,
>>>>
>>>>   I tried to renice the processes with the Priority of the queues 
>>>> and it shares the processors between processes (66%-33% ratio).  I 
>>>> also tried to renice the parallel job to the lower 19 priority, but 
>>>> it is not propagating to the slave nodes.  Only the master node 
>>>> processes have a higher scheduling priority.
>>>>
>>>>   I think I will stick to my suspension scripts since I really need 
>>>> the higher priority queue to get ALL the processor whenever they 
>>>> need it.  I would warn the users of the lower priority queue that 
>>>> the result can be bad if suspension occurs.
>>>>
>>>> Thanks.
>>>>
>>>> Patrice Hamelin wrote:
>>>>
>>>>> Reuti,
>>>>>
>>>>>   At first I was killing all the user's processes, but there was a 
>>>>> problem with that.  The suspend script is running as the user ID 
>>>>> itself, and not as sgeadmin, as I first thought.  The result was 
>>>>> that the shell running the kill's was killing itself, heading to 
>>>>> unwanted result.
>>>>>
>>>>>   I tested my suspension script with two different MPI codes, one 
>>>>> that is doing only communication, and another one that computes 
>>>>> Jacobi integration in parallel.  The main problem in getting the 
>>>>> PIDs it that I really have to target those two processes that are 
>>>>> runnning the users code and eating 99% or so of the CPU each. I 
>>>>> have to suspend only those two processes.  I verified that the 
>>>>> processes are in T state after the kill -19 command was sent to them.
>>>>>
>>>>>   I agree that there may be weird results doing that kind of 
>>>>> operation on an MPICH job, and I will also test the queue priority 
>>>>> 19 that you mentionned.  It looks promising!
>>>>>
>>>>> Ciao!
>>>>>
>>>>> Reuti wrote:
>>>>>
>>>>>> Hi Patrice,
>>>>>>
>>>>>> I really think, it's not a good idea to suspend a MPICH-GM job. 
>>>>>> IMO the easier solution would be to have a special cluster queue 
>>>>>> with a priority of 19 for them. So any other job running on the 
>>>>>> nodes in another queue with a priority of 0 will get most of the 
>>>>>> CPU time.
>>>>>>
>>>>>> But anyway, if I understand your script in the correct way: you 
>>>>>> want to suspend all jobs from a user on a node by selecting 
>>>>>> him/her by the $LOGNAME in top? So the user's name may not appear 
>>>>>> in any other field at all, and only one job per user per node is 
>>>>>> the limitation. And: head -2 will list the first two lines, at 
>>>>>> least I get only two blank lines with it (which platform/OS are 
>>>>>> you using?).
>>>>>>
>>>>>> Whether you decide to use it, or get a special cluster queue for 
>>>>>> MPICH-GM: better suited is the ps command, because there you can 
>>>>>> specify a user and output format, hence the complete:
>>>>>>
>>>>>> top -b -n1 | grep $LOGNAME | head -2 | awk '{print $1}'
>>>>>>
>>>>>> can be:
>>>>>>
>>>>>> ps --user $LOGNAME -o pid --no-headers
>>>>>>
>>>>>>
>>>>>> Next enhancement is not to stop each process on its own, but the 
>>>>>> whole process group (if you have a tight integration according to 
>>>>>> the Howto for MPICH, which also has a hint for MPICH-GM) [of 
>>>>>> course: it's untested]:
>>>>>>
>>>>>> for proc in `rsh $nodes ps --user $LOGNAME -o pgrp --no-headers | 
>>>>>> uniq` ; do
>>>>>>     rsh $nodes kill -19 -- -$proc
>>>>>> done
>>>>>>
>>>>>> If there is only one job on the node, you wouldn't need the loop 
>>>>>> at all now. Did you verified on the nodes, that your job is 
>>>>>> really suspended with your script by looking in the e.g. "ps -e 
>>>>>> f" output for the field STAT which will show T for stopped jobs?
>>>>>>
>>>>>>
>>>>>> Cheers - Reuti
>>>>>>
>>>>>>
>>>>>> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
>>>>>>
>>>>>>
>>>>>>> Reuti,
>>>>>>>
>>>>>>>   Thanks for your answer, and sorry not to give enough details.  
>>>>>>> I am using GE 6.0u1 with MPICH-GM implementation. I found 
>>>>>>> nothing interesting in the qmaster message file.  My script 
>>>>>>> simply send SIGSTOP signals to all the MPI processes on all 
>>>>>>> nodes members of the job.  I tested it with a simple 
>>>>>>> communication program, but I still have to test it in a real 
>>>>>>> production environment, next step.  You will find my cript 
>>>>>>> below.  The "unsuspend" script simply send SIGCONT signal to the 
>>>>>>> processes.
>>>>>>>
>>>>>>>   I fooled the re-suspension by creating a file at the first 
>>>>>>> suspension.
>>>>>>>
>>>>>>> F=/tmp/suspend_MPI_job.$LOGNAME.log
>>>>>>> touch $F
>>>>>>>
>>>>>>> if [ -f $TMPDIR/suspended ];then
>>>>>>>   echo "`date` Job already suspended; exiting" >> $F
>>>>>>>   exit
>>>>>>> fi
>>>>>>> #
>>>>>>> # For each node
>>>>>>> #
>>>>>>>   for nodes in `cat $TMPDIR/machines | /usr/bin/uniq`
>>>>>>>   do
>>>>>>> #
>>>>>>> # Create a file that contains PIDs of suspended processes
>>>>>>> #
>>>>>>>     touch $TMPDIR/$nodes
>>>>>>>     > $TMPDIR/$nodes
>>>>>>> #
>>>>>>> # Determine processes to suspend
>>>>>>> #
>>>>>>>     for proc in `rsh $nodes top -b -n1 | grep $LOGNAME | head -2 
>>>>>>> | awk '{print $1}'`
>>>>>>>     do
>>>>>>>       echo "`date` Suspending process $proc on $nodes" >> $F
>>>>>>>       echo $proc >> $TMPDIR/$nodes
>>>>>>>       rsh $nodes kill -19 $proc
>>>>>>>     done
>>>>>>>   done
>>>>>>> touch $TMPDIR/suspended
>>>>>>>
>>>>>>>
>>>>>>> Reuti wrote:
>>>>>>>
>>>>>>>> Quoting Patrice Hamelin <phamelin at clumeq.mcgill.ca>:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>  I wrote a script to suspend MPI job in a queue and include 
>>>>>>>>> that path in the "suspend method" field of the queue 
>>>>>>>>> configuration.  My problem is that SGE keeps trying to suspend 
>>>>>>>>> the job again every minutes, even though I setup my queue like:
>>>>>>>>>
>>>>>>>>> suspend_thresholds    load_avg=3.0
>>>>>>>>> nsuspend              0
>>>>>>>>> suspend_interval      INFINITY
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> It may not be easy to suspend a MPI job at all (which MPI 
>>>>>>>> implementation?),
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> because of possible timeouts in the communication. What are you 
>>>>>>>> doing in
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> your
>>>>>>>
>>>>>>>> script exactly, which version of SGE and: are there any entries 
>>>>>>>> in the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> messages
>>>>>>>
>>>>>>>> files of the qmaster and/or execd? Just 60 seconds it's just 
>>>>>>>> like the
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> deafult
>>>>>>>
>>>>>>>> notify time - how did you submitted your job? - Reuti
>>>>>>>>
>>>>>>>> --------------------------------------------------------------------- 
>>>>>>>>
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>> For additional commands, e-mail: 
>>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Patrice Hamelin ing, M.Sc.A, CCNA
>>>>>>> Systems Administrator
>>>>>>> CLUMEQ Supercomputer Centre
>>>>>>> McGill University
>>>>>>> 688 Sherbrooke Street West, Suite 710
>>>>>>> Montreal, QC, Canada H3A 2S6
>>>>>>> Tel: 514-398-3344
>>>>>>> Fax: 514-398-2203
>>>>>>> http://www.clumeq.mcgill.ca
>>>>>>>
>>>>>>> --------------------------------------------------------------------- 
>>>>>>>
>>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>> For additional commands, e-mail: 
>>>>>>> users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --------------------------------------------------------------------- 
>>>>>>
>>>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list