[GE users] SGE6 does not backfill

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Sat Apr 16 11:01:38 BST 2005


    [ The following text is in the "ISO-8859-15" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello,

I understand the bug now. A signaling a pe job in transfering state 
kills the job most likely (80%).
A signal can be generated by:

qmod -s "*"
qmod -us "*"

the scheduler, through the suspend_thresholds settings

by the qmaster callendar.

Due to the speed up in the pe startup process, this should happen only 
very, very seldom now. Based
on my data, a misplaced qmod -s and qmod -us is the only way to kill a 
pe job. The scheduler should
be too slow for it.

Cheers,
Stephan

Christian Bolliger wrote:

> Hi Juha
>
> How many files are open on your master system (lsof | wc -l)?
> Have you controled also the hard file descriptor limit ('ulimit -Hn' 
> or  'limit -h  decriptors').
>
> What happend here is that we lost jobs because of the hard file 
> descriptor limit (but we have 256 nodes).
>
> Best regards
> Christian
>
> Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>
>> Hi Juha,
>>
>> I just checked in a fix for one of the problems we found during 
>> evaluation
>> your problem report. PE jobs should not start much faster and the 
>> likely hod
>> of loosing jobs should be nearly 0.
>> Could you download the latest changes and test them in our env.?
>>
>> Thank you very much for your detailed problem analysis and your help
>> with testing.
>>
>> Kind Regards,
>> Stephan
>>
>> Juha Jäykkä wrote:
>>
>>  
>>
>>>> For me it's only dropped, if there is something running in all 
>>>> slots of
>>>> a  queue. But not for a reservation. - Reuti
>>>>  
>>>>     
>>>
>>> Perhaps this is the cause of my trouble then. Any ideas how to fix 
>>> this?
>>> It's quite frustrating to have a cluster which has 23 CPUs out of 24 
>>> just
>>> doing nothing (current situation), because one job reserves all the 
>>> CPUs
>>> and backfill does not work at all.
>>>
>>>
>>>
>>>   
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>  
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list