[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

reuti reuti at staff.uni-marburg.de
Mon May 17 16:39:28 BST 2010


Hi,

Am 17.05.2010 um 16:24 schrieb ah_sunsource:

> Hi *,
> 
> in our case the problem is back and seems to be related to finishing
> parallel jobs (every time such a job ends, SGE master is segfaulting):
> 
> [pan] /root # dmesg
> [...]
> sge_qmaster[29665] general protection rip:58801d rsp:4888fbb0 error:0
> sge_qmaster[10990] general protection rip:58801d rsp:48c11bb0 error:0
> sge_qmaster[14694]: segfault at 000065642e68666d rip 000000000058801d rsp 0000000047ff6bb0 error 4
> sge_qmaster[22962]: segfault at 0000000065642e6c rip 000000000058801d rsp 0000000040b93bb0 error 4
> sge_qmaster[23236] general protection rip:58801d rsp:483ddbb0 error:0
> sge_qmaster[24762]: segfault at 0000000000000034 rip 000000000058801d rsp 0000000048481bb0 error 4
> sge_qmaster[25262]: segfault at 000000000000656b rip 000000000058801d rsp 0000000047b42bb0 error 4
> sge_qmaster[26433]: segfault at 0000003200000004 rip 000000000058801d rsp 00000000488eebb0 error 4
> sge_qmaster[26557]: segfault at 00004d3834383639 rip 000000000058801d rsp 0000000048c0ebb0 error 4
> 
> Currently it is more or less reproducible whereas the same
> configurations survived months without a single crash before ... Here

just for curiosity, as I'm also searching for the cause of "random" crashes: was there any recent kernel update?

-- Reuti


> are the log lines around such a crash (Job 786578 was a 32 slot wide PE
> job):
> 
> [...]
> 05/17/2010 16:16:32|worker|pan|I|task 1.pax4f at pax4f.ifh.de of job 786578.1 finished
> 05/17/2010 16:16:33|worker|pan|I|task 1.pax4c at pax4c.ifh.de of job 786578.1 finished
> 05/17/2010 16:16:33|worker|pan|I|task 1.pax4a at pax4a.ifh.de of job 786578.1 finished
> 05/17/2010 16:16:33|worker|pan|I|task 1.pax46 at pax46.ifh.de of job 786578.1 finished
> 05/17/2010 16:16:33|worker|pan|I|task 1.pax47 at pax47.ifh.de of job 786578.1 finished
> 05/17/2010 16:16:34|event_|pan|P|event_master000: runs: 60.77r/s (clients: 1.00 mod: 0.20/s ack: 0.20/s blocked: 0.00 busy: 0.18 | events: 182.58/s added: 182.24/s skipt: 0.33/s) out: 0.00m/s APT: 0.0002s/m idle: 98.83% wait: 0.00% time: 14.99s
> 05/17/2010 16:16:34| timer|pan|P|timer000: runs: 0.40r/s (pending: 204.00 executed: 0.33/s) out: 0.00m/s APT: 0.0004s/m idle: 99.98% wait: 0.00% time: 14.99s
> 05/17/2010 16:17:01|  main|pan|W|local configuration pan.ifh.de not defined - using global configuration
> 05/17/2010 16:17:01|  main|pan|I|using "/usr/gridengine/default/spool" for execd_spool_dir
> [...]
> 
> Cheers,
> Andreas
> -- 
> | Andreas Haupt             | E-Mail: andreas.haupt at desy.de
> |  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
> |  Platanenallee 6          | Phone:  +49/33762/7-7359
> |  D-15738 Zeuthen          | Fax:    +49/33762/7-7216
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257607
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257613

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list