[GE users] SGE Rescheduling

Sreenath Nampally sreenath at tigr.ORG
Thu Aug 3 14:13:51 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Dan,

This is exactly the same issue that I am noticing. The rescheduled job 
is starting
even before the original job is killed.  After the reschedule, I am able 
to see both
original and rescheduled job running for some time before the orginal is 
killed.
This is causing some issues for us.

Could it be possible that SGE may issued a KILL signal to the original 
job but for
some reason the jobs are not killed and stay on the exechost ?
I may have noticed this behavior also.  In this case, SGE does not know 
that these
original jobs are still running, so, it continues scheduling more jobs 
to this exechost
which is causing heavy loads on the exechosts.

BTW, how are you compesating for this asynchronus rescheduling ?

Thanks
Sree


Gruhn Daniel J Contractor AF/A9IT wrote:

>One additional thing, I don't think the bug with rescheduling is fixed yet.
>That bug is that rescheduling seems to be an asyncronous process.  That is,
>the rescheduled job may be able to get started before the original job is
>killed.  In my case this makes a difference and I have to compensate for it.
>
>Dan
>
>//SIGNED//
>Daniel J.Gruhn, CTR (Group W Inc.)
>HQ USAF/A9IT
>Studies & Analyses, Assesments and Lessons Learned
>
>
>-----Original Message-----
>From: Reuti [mailto:reuti at staff.uni-marburg.de] 
>Sent: Thursday, August 03, 2006 7:33 AM
>To: users at gridengine.sunsource.net
>Subject: Re: [GE users] SGE Rescheduling
>
>Hi,
>
>Am 02.08.2006 um 23:14 schrieb Sreenath Nampally:
>
>  
>
>>Hello,
>>
>>Could someone explain the sequence of events that happen in SGE (both 
>>on qmaster and exec host) when a job is rescheduled  and suspended? 
>>What signals are sent to the job ?
>>    
>>
>
>if the job gets supended, it will get a SIGSTOP which you can't catch. But
>you could submit the job with -notify, to get a warning before, which you
>can catch. Have a look at `man qsub`, and you could even redefine the
>signal: `man sge_conf`section execd_params. But be aware, that the signal
>will be send to the whole process group, and this might need proper handling
>in the jobscript and the compiled program.
>
>If you reschedule a job, it will be killed, and also before this you could
>get a warning by -notify. But I think, you will only get the information
>about the kill, but not the reason that it will be rescheduled. Only during
>the next run, you can test the variable RESTARTED, whether it's 1. If you
>need a more sophisticated handling, you can also try to use the
>checkpointing interface.
>
>HTH - Reuti
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list