[GE users] sge_qmaster 6.2u5 daemon: repeating segfaults

ah_sunsource ahaupt at ifh.de
Mon May 17 15:24:27 BST 2010


Hi *,

in our case the problem is back and seems to be related to finishing
parallel jobs (every time such a job ends, SGE master is segfaulting):

[pan] /root # dmesg
[...]
sge_qmaster[29665] general protection rip:58801d rsp:4888fbb0 error:0
sge_qmaster[10990] general protection rip:58801d rsp:48c11bb0 error:0
sge_qmaster[14694]: segfault at 000065642e68666d rip 000000000058801d rsp 0000000047ff6bb0 error 4
sge_qmaster[22962]: segfault at 0000000065642e6c rip 000000000058801d rsp 0000000040b93bb0 error 4
sge_qmaster[23236] general protection rip:58801d rsp:483ddbb0 error:0
sge_qmaster[24762]: segfault at 0000000000000034 rip 000000000058801d rsp 0000000048481bb0 error 4
sge_qmaster[25262]: segfault at 000000000000656b rip 000000000058801d rsp 0000000047b42bb0 error 4
sge_qmaster[26433]: segfault at 0000003200000004 rip 000000000058801d rsp 00000000488eebb0 error 4
sge_qmaster[26557]: segfault at 00004d3834383639 rip 000000000058801d rsp 0000000048c0ebb0 error 4

Currently it is more or less reproducible whereas the same
configurations survived months without a single crash before ... Here
are the log lines around such a crash (Job 786578 was a 32 slot wide PE
job):

[...]
05/17/2010 16:16:32|worker|pan|I|task 1.pax4f at pax4f.ifh.de of job 786578.1 finished
05/17/2010 16:16:33|worker|pan|I|task 1.pax4c at pax4c.ifh.de of job 786578.1 finished
05/17/2010 16:16:33|worker|pan|I|task 1.pax4a at pax4a.ifh.de of job 786578.1 finished
05/17/2010 16:16:33|worker|pan|I|task 1.pax46 at pax46.ifh.de of job 786578.1 finished
05/17/2010 16:16:33|worker|pan|I|task 1.pax47 at pax47.ifh.de of job 786578.1 finished
05/17/2010 16:16:34|event_|pan|P|event_master000: runs: 60.77r/s (clients: 1.00 mod: 0.20/s ack: 0.20/s blocked: 0.00 busy: 0.18 | events: 182.58/s added: 182.24/s skipt: 0.33/s) out: 0.00m/s APT: 0.0002s/m idle: 98.83% wait: 0.00% time: 14.99s
05/17/2010 16:16:34| timer|pan|P|timer000: runs: 0.40r/s (pending: 204.00 executed: 0.33/s) out: 0.00m/s APT: 0.0004s/m idle: 99.98% wait: 0.00% time: 14.99s
05/17/2010 16:17:01|  main|pan|W|local configuration pan.ifh.de not defined - using global configuration
05/17/2010 16:17:01|  main|pan|I|using "/usr/gridengine/default/spool" for execd_spool_dir
[...]

Cheers,
Andreas
-- 
| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=257607

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list