[GE users] Using qdel leaves queues in error status

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Tue May 27 15:12:50 BST 2008


Hi Filipe,

On Tue, 27 May 2008, Filipe Brandenburger wrote:

> Hi Andreas,
>
> Thank you very much for your answer. I will consider moving the local
> queues (and as long as I'm at it, the binaries as well) to local disk.

Good.

> Andreas.Haas at Sun.COM wrote:
>> Another possibility is to upgrade to 6.1u4. That way you would
>>    752      6288953   scalability issue with qdel and very large array jobs
>
> That's great!
>
> I found the bug report for this issue here:
> http://gridengine.sunsource.net/issues/show_bug.cgi?id=752
> However, I couldn't find the actual patch (#6288953). Could you please
> point me to it?
>
> I was wondering if this patch would be simple and non-intrusive enough
> that I could apply it to 6.0, because the grid right now is very busy
> and it's probably going to be quite long until I will be able to upgrade
> it to 6.1.
>

#6288953 is no patch ID. It is the ID that is used by the bug 
tracking tool that we use inside Sun, so you can't access it except 
if you have a SGE license/support contract.

As for the patches you would need at least 6.0u11 if you stay with 6.0 version

    http://gridengine.sunsource.net/downloads/60/download.html

or you upgrade to most recent 6.1 that is 6.1u4

    http://gridengine.sunsource.net/downloads/61/download.html

but note #752 mentions specifically array jobs: Do you have array-jobs?
If not, then you may not expect that #752 will bring you an improvement.

>
> There is another thing about this problem that I would like to try to
> understand. It happened twice, but the first time processes got the HUP
> signal, and the second time they got the KILL signal:
>
>> 05/20/2008 09:29:44|qmaster|sgemaster|W|job 7972300.1 failed on host s14.mydomain.com assumedly after job because: job 7972300.1 died through signal HUP (1)
>
>> 05/21/2008 18:42:10|qmaster|sgemaster|W|job 8011707.1 failed on host j05.mydomain.com assumedly after job because: job 8011707.1 died through signal KILL (9)
>
> I would like to understand what caused this difference in behaviour,
> since I don't really like the idea of having processes (specially lots
> of them) being killed with SIGKILL.

Are you using -notify submit option?

> Is it something with qdel that
> activates the KILL signal, like the -f argument?

Don't think so.

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list