[GE users] SGE 6.2u3 issues with segfaulting qmaster and sge_execd's killing running jobs after a migration

templedf dan.templeton at sun.com
Wed Sep 30 15:24:02 BST 2009


Unfortunately, without a debug version, your stack trace isn't terribly 
useful.  Maybe turn on debug level 2 and try it again.

The qmaster will tell the execd to kill running jobs if the qmaster has 
already migrated those jobs from a downed execd, and then that downed 
execd comes back up with the jobs still running.

Daniel

craffi wrote:
> Hi folks,
>
> Two quick things. I've got 2 virtual machines (RHEL 4) - one machine  
> can't run the SGE qmaster for more than 60 seconds without crashing  
> while the second virtual machine runs SGE just fine. Here is the log  
> message:
>
> Anyone see segfaults like this before?
>
> sge_qmaster[25857]: segfault at 0000000000000080 rip 00000035c1878d80  
> rsp 0000000047caa978 error 4
> sge_qmaster[26061]: segfault at 0000000000000080 rip 00000035c1878d80  
> rsp 0000000048384978 error 4
>
>
> Second issue - sge_execd kills running jobs after SGE qmaster migration:
>
>
> Can someone explain to me what happens behind the scenes when the  
> qmaster and an execd disagree about a running job? After we migrated  
> the sge_qmaster to the second virtual host, SGE killed off a bunch of  
> running jobs for reasons we can't quite understand.
>
> In my mind the only reason to trigger the running job kill was some  
> sort of disagreement in the spooling database. Has anyone seen this or  
> can a developer explain the decision process for how an execd decides  
> a job is invalid and needs to be killed?
>
>   
>> 09/30/2009 09:31:56|worker|lctcvh6002|E|execd at 47p011 reports running  
>> job (255285.1/20.47p011) in queue "all.q at 47p011" that was not  
>> supposed to be there - killing
>>
>> 09/30/2009 09:31:56|worker|lctcvh6002|E|execd at 52p01c reports running  
>> job (255285.1/20.52p01c) in queue "all.q at 52p01c" that was not  
>> supposed to be there - killing
>>
>> 09/30/2009 09:31:57|worker|lctcvh6002|E|execd at 47p01e reports running  
>> job (255285.1/20.47p01e) in queue "all.q at 47p01e" that was not  
>> supposed to be there - killing
>>
>> 09/30/2009 09:31:57|worker|lctcvh6002|E|execd at 47p01e reports running  
>> job (255285.1/master) in queue "all.q at 47p01e" that was not supposed  
>> to be there - killing
>>
>> 09/30/2009 09:31:58|worker|lctcvh6002|E|execd at 50p022 reports running  
>> job (255285.1/20.50p022) in queue "all.q at 50p022" that was not  
>> supposed to be there - killing
>>
>> 09/30/2009 09:31:59|worker|lctcvh6002|E|execd at 55p014 reports running  
>> job (255285.1/20.55p014) in queue "all.q at 55p014" that was not  
>> supposed to be there - killing
>>
>> 09/30/2009 09:31:59|worker|lctcvh6002|E|execd at 52p029 reports running  
>> job (255285.1/20.52p029) in queue "all.q at 52p029" that was not  
>> supposed to be there - killing
>>
>> 09/30/2009 09:31:59|worker|lctcvh6002|E|execd at 50p018 reports running  
>> job (255285.1/20.50p018) in queue "all.q at 50p018" that was not  
>> supposed to be there - killing
>>
>> 09/30/2009 09:31:59|worker|lctcvh6002|E|execd at 47p029 reports running  
>> job (255285.1/20.47p029) in queue "all.q at 47p029" that was not  
>> supposed to be there - killing
>>
>>
>>     
>
>
> -Chris
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=219787
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=219789

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list