[GE users] seg fault with SGE 6.2u5 server

tvsingh tvsingh at ucla.edu
Fri Dec 10 16:56:44 GMT 2010


    [ The following text is in the "Windows-1252" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello there,

We have a decent size cluster that execute some 3000 jobs on an average on daily basis. I started looking at this setup closely for last couple of weeks and noticed the following errors in the system?s messages file:

Dec  9 10:27:18 localhost kernel: sge_qmaster[20498]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000483e3988 error 4
Dec  9 10:28:48 localhost kernel: sge_qmaster[20826]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484a5988 error 4
Dec  9 10:52:03 localhost kernel: sge_qmaster[21880]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000486ac988 error 4
Dec 10 00:55:46 localhost kernel: sge_qmaster[7994]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484df988 error 4

The server is based on the binaries of SGE6.2u5 and OS is CentOS 5.x. Also I noticed many a times the memory usage by q master keeps increasing without any visible reason and that leads server to crash.
It does not seem to be due to heavy load as  other times the system is running normal even when the load (system?s job throughput per hour) is much more.

Any help will be much appreciated,

Thanks in advance,
TV Singh




More information about the gridengine-users mailing list