[GE users] SGE 6.2u3 issues with segfaulting qmaster and sge_execd's killing running jobs after a migration

craffi dag at sonsorol.org
Wed Sep 30 15:16:12 BST 2009


Hi folks,

Two quick things. I've got 2 virtual machines (RHEL 4) - one machine  
can't run the SGE qmaster for more than 60 seconds without crashing  
while the second virtual machine runs SGE just fine. Here is the log  
message:

Anyone see segfaults like this before?

sge_qmaster[25857]: segfault at 0000000000000080 rip 00000035c1878d80  
rsp 0000000047caa978 error 4
sge_qmaster[26061]: segfault at 0000000000000080 rip 00000035c1878d80  
rsp 0000000048384978 error 4


Second issue - sge_execd kills running jobs after SGE qmaster migration:


Can someone explain to me what happens behind the scenes when the  
qmaster and an execd disagree about a running job? After we migrated  
the sge_qmaster to the second virtual host, SGE killed off a bunch of  
running jobs for reasons we can't quite understand.

In my mind the only reason to trigger the running job kill was some  
sort of disagreement in the spooling database. Has anyone seen this or  
can a developer explain the decision process for how an execd decides  
a job is invalid and needs to be killed?

> 09/30/2009 09:31:56|worker|lctcvh6002|E|execd at 47p011 reports running  
> job (255285.1/20.47p011) in queue "all.q at 47p011" that was not  
> supposed to be there - killing
>
> 09/30/2009 09:31:56|worker|lctcvh6002|E|execd at 52p01c reports running  
> job (255285.1/20.52p01c) in queue "all.q at 52p01c" that was not  
> supposed to be there - killing
>
> 09/30/2009 09:31:57|worker|lctcvh6002|E|execd at 47p01e reports running  
> job (255285.1/20.47p01e) in queue "all.q at 47p01e" that was not  
> supposed to be there - killing
>
> 09/30/2009 09:31:57|worker|lctcvh6002|E|execd at 47p01e reports running  
> job (255285.1/master) in queue "all.q at 47p01e" that was not supposed  
> to be there - killing
>
> 09/30/2009 09:31:58|worker|lctcvh6002|E|execd at 50p022 reports running  
> job (255285.1/20.50p022) in queue "all.q at 50p022" that was not  
> supposed to be there - killing
>
> 09/30/2009 09:31:59|worker|lctcvh6002|E|execd at 55p014 reports running  
> job (255285.1/20.55p014) in queue "all.q at 55p014" that was not  
> supposed to be there - killing
>
> 09/30/2009 09:31:59|worker|lctcvh6002|E|execd at 52p029 reports running  
> job (255285.1/20.52p029) in queue "all.q at 52p029" that was not  
> supposed to be there - killing
>
> 09/30/2009 09:31:59|worker|lctcvh6002|E|execd at 50p018 reports running  
> job (255285.1/20.50p018) in queue "all.q at 50p018" that was not  
> supposed to be there - killing
>
> 09/30/2009 09:31:59|worker|lctcvh6002|E|execd at 47p029 reports running  
> job (255285.1/20.47p029) in queue "all.q at 47p029" that was not  
> supposed to be there - killing
>
>


-Chris

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=219787

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list