[GE users] sgeexecd stop Doesn't Stop Jobs

Reuti reuti at staff.uni-marburg.de
Tue Oct 21 13:22:04 BST 2008


Am 21.10.2008 um 13:36 schrieb Reuti:

> Am 21.10.2008 um 06:17 schrieb Ron Chen:
>
>> Even if the behaviour is intentional, I think it is not desired.
>>
>> I think we should either let the shepherds running, or kill the  
>> job and shepherd.
>
> I agree with this. Especially as the job is removed from the qstat.
>
>> Another *real important thing* is, when the execd restarts, does  
>> it reads back the jobs that are already running?
> Yes, it looks so:
>
> 10/21/2008 13:34:27|  main|pc15370|I|registered at qmaster host  
> "pc15370.Chemie.Uni-Marburg.DE"
> 10/21/2008 13:34:27|  main|pc15370|I|starting up GE 6.2 (lx24-x86)
> 10/21/2008 13:34:27|  main|pc15370|I|successfully started PDC and PTF
> 10/21/2008 13:34:27|  main|pc15370|I|checking for old jobs
> 10/21/2008 13:34:27|  main|pc15370|I|found directory of job  
> "active_jobs/87.1"
> 10/21/2008 13:34:27|  main|pc15370|I|shepherd for job active_jobs/ 
> 87.1 has pid "24767" and is  alive
>
> and qdel is also working fine.

To clarify: this was after a soft-shutdown while the shepherd is  
still running.

With a complete shutdown, all information is lost when the sgeexecd  
restarts. The logic seems to be: execd looks in active_jobs in the  
spool directory, finds the job information and looks for the  
corresponding shepherd, as it can't find him, the directory is  
cleared and the job judged as failed.

-- Reuti


>
> -- Reuti
>
>
>>
>>  -Ron
>>
>>
>> --- On Tue, 10/21/08, Daniel Templeton <Dan.Templeton at Sun.COM> wrote:
>>> I just noticed some behavior that doesn't seem right,
>>> and I wanted a
>>> second opinion.  If I submit a job and then stop the
>>> execution daemon
>>> where the job is running using "sgeexecd stop",
>>> the execution daemon and
>>> the shepherd are killed, but the job itself keeps running.
>>> If I ask
>>> qacct what happened, it lists the job as having failed
>>> before writing
>>> the exit status.  The behavior is the same for 6.0u10 and
>>> 6.2.
>>>
>>> I seem to recall having had a conversation about this
>>> before, and I seem
>>> to think that this behavior was intentional, so I'd
>>> like confirmation
>>> one way or another before I file an issue.
>>>
>>> Daniel
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail:
>>> users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail:
>>> users-help at gridengine.sunsource.net
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Tired of spam?  Yahoo! Mail has the best spam protection around
>> http://mail.yahoo.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list