[GE users] qmaster problem on sge 6.5

jmarshall John.Marshall at ec.gc.ca
Wed Nov 10 01:55:45 GMT 2010


On 11/09/2010 08:45 PM, udo wrote:
> Dear SGE community,
>
> Recently I started experience sever problems with qmaster of SGE 6.5.
> Symptoms like these:
>
> [2221661.821388] sge_qmaster[24247] general protection ip:55d12b
> sp:7ff8d66f78c0 error:0 in sge_qmaster[400000+237000]
> [2240208.823037] sge_qmaster[12529]: segfault at 24 ip 000000000055c698 sp
> 00002b23dca06990 error 4 in sge_qmaster[400000+237000]
> [2243450.822069] sge_qmaster[12770] general protection ip:55c698
> sp:2acd8fa06990 error:0 in sge_qmaster[400000+237000]
>
>
> [179722.467392] sge_qmaster[5645]: segfault at 24 ip 000000000055c698 sp
> 00007f81d10f7990 error 4 in sge_qmaster[400000+237000]
> [181072.467118] sge_qmaster[12839]: segfault at 192d00001934 ip
> 000000000055c698 sp 00007f90376f7990 error 4 in sge_qmaster[400000+237000]
>   [193402.467308] sge_qmaster[13899]: segfault at 7f2c0000000f ip
> 000000000055c698 sp 00007f2ce40f7990 error 4 in sge_qmaster[400000+237000]
>
> The only what I've done is upgraded from SuSE 10.2 to 11.2 and I need to say
That sounds like a big change. Check the difference in libc versions. It might be the
problem.

John
> that the SGE server also file server with 4 port bonding  i.e. I have intel
> 4-port  gigabit card which I combined to one bonding port. I also have one
> more gigabit port which I use for communications with nodes.
> Cluster is mostly Opteron nodes 8,16, 32 cores.
>
> Crash happens one in a while but it can be within a few minutes period of a
> few hours period.
> Right now I am running script which checks if qmaster is running or not and
> if not it starts it but it is not the best solution I can imaging.
>
> Any suggestions how to cure are very welcome.
>
> Regards,
> Viktor
> p.s.
> [20:40:53]udo at rupc-caip-03:~>qstat -v
> GE 6.2u5
> usage: qstat [options]
>
> below I give all error messages of queueing system for today while happened
> 3 or 4 crashes. I can't  see nothing suspeciaous in qmaster logs:
> /opt/sge6/core/spool/qmaster>tail -130 messages
>
>
> 11/09/2010 01:56:51|worker|rupc-caip-03|W|job 3919.1 failed on host n105
> assumedly after job because: job 3919.1 died through signal TERM (15)
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|read job database with 6 entries
> in 0 seconds
> 11/09/2010 07:39:02|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|read job database with 6 entries
> in 0 seconds
> 11/09/2010 08:00:02|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 11/09/2010 13:34:25|worker|rupc-caip-03|W|job 3920.1 failed on host n101
> assumedly after job because: job 3920.1 died through signal TERM (15)
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|read job database with 5 entries
> in 0 seconds
> 11/09/2010 13:52:02|  main|rupc-caip-03|W|removing reference to no longer
> existing job 3920 of user "sinisa"
>
> 11/09/2010 13:52:02|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|read job database with 4 entries
> in 0 seconds
> 11/09/2010 14:13:02|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 11/09/2010 14:47:45|worker|rupc-caip-03|W|job 3935.1 failed on host n114
> assumedly after job because: job 3935.1 died through signal TERM (15)
> 11/09/2010 15:11:46|worker|rupc-caip-03|W|job 3938.1 failed on host n114
> assumedly after job because: job 3938.1 died through signal TERM (15)
> 11/09/2010 15:21:10|worker|rupc-caip-03|E|tightly integrated parallel task
> 3940.1 task 1.n114 failed - killing job
> 11/09/2010 15:22:26|worker|rupc-caip-03|W|job 3940.1 failed on host n114
> assumedly after job because: job 3940.1 died through signal TERM (15)
> 11/09/2010 15:27:46|worker|rupc-caip-03|W|job 3939.1 failed on host n114
> assumedly after job because: job 3939.1 died through signal TERM (15)
> 11/09/2010 15:33:07|worker|rupc-caip-03|W|job 3943.1 failed on host n114
> assumedly after job because: job 3943.1 died through signal TERM (15)
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|read job database with 7 entries
> in 1 seconds
> 11/09/2010 17:39:03|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294409
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294411

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list