[GE users] qmaster problem on sge 6.5

udo udo at physics.rutgers.edu
Wed Nov 10 01:45:53 GMT 2010


Dear SGE community,

Recently I started experience sever problems with qmaster of SGE 6.5.
Symptoms like these:

[2221661.821388] sge_qmaster[24247] general protection ip:55d12b
sp:7ff8d66f78c0 error:0 in sge_qmaster[400000+237000]
[2240208.823037] sge_qmaster[12529]: segfault at 24 ip 000000000055c698 sp
00002b23dca06990 error 4 in sge_qmaster[400000+237000]
[2243450.822069] sge_qmaster[12770] general protection ip:55c698
sp:2acd8fa06990 error:0 in sge_qmaster[400000+237000]


[179722.467392] sge_qmaster[5645]: segfault at 24 ip 000000000055c698 sp
00007f81d10f7990 error 4 in sge_qmaster[400000+237000]
[181072.467118] sge_qmaster[12839]: segfault at 192d00001934 ip
000000000055c698 sp 00007f90376f7990 error 4 in sge_qmaster[400000+237000]
 [193402.467308] sge_qmaster[13899]: segfault at 7f2c0000000f ip
000000000055c698 sp 00007f2ce40f7990 error 4 in sge_qmaster[400000+237000]

The only what I've done is upgraded from SuSE 10.2 to 11.2 and I need to say
that the SGE server also file server with 4 port bonding  i.e. I have intel
4-port  gigabit card which I combined to one bonding port. I also have one
more gigabit port which I use for communications with nodes.
Cluster is mostly Opteron nodes 8,16, 32 cores. 

Crash happens one in a while but it can be within a few minutes period of a
few hours period.
Right now I am running script which checks if qmaster is running or not and
if not it starts it but it is not the best solution I can imaging.

Any suggestions how to cure are very welcome.
 
Regards,
Viktor
p.s.
[20:40:53]udo at rupc-caip-03:~>qstat -v
GE 6.2u5
usage: qstat [options]

below I give all error messages of queueing system for today while happened
3 or 4 crashes. I can't  see nothing suspeciaous in qmaster logs:
/opt/sge6/core/spool/qmaster>tail -130 messages


11/09/2010 01:56:51|worker|rupc-caip-03|W|job 3919.1 failed on host n105
assumedly after job because: job 3919.1 died through signal TERM (15)
11/09/2010 07:39:02|  main|rupc-caip-03|I|read job database with 6 entries
in 0 seconds
11/09/2010 07:39:02|  main|rupc-caip-03|E|error opening file
"/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
directory
11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
set to 8192
11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
set to 8192
11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
descriptors for communication
11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will accept max. 99
dynamic event clients
11/09/2010 07:39:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
11/09/2010 08:00:02|  main|rupc-caip-03|I|read job database with 6 entries
in 0 seconds
11/09/2010 08:00:02|  main|rupc-caip-03|E|error opening file
"/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
directory
11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
set to 8192
11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
set to 8192
11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
descriptors for communication
11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will accept max. 99
dynamic event clients
11/09/2010 08:00:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
11/09/2010 13:34:25|worker|rupc-caip-03|W|job 3920.1 failed on host n101
assumedly after job because: job 3920.1 died through signal TERM (15)
11/09/2010 13:52:02|  main|rupc-caip-03|I|read job database with 5 entries
in 0 seconds
11/09/2010 13:52:02|  main|rupc-caip-03|W|removing reference to no longer
existing job 3920 of user "sinisa"

11/09/2010 13:52:02|  main|rupc-caip-03|E|error opening file
"/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
directory
11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
set to 8192
11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
set to 8192
11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
descriptors for communication
11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will accept max. 99
dynamic event clients
11/09/2010 13:52:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
11/09/2010 14:13:02|  main|rupc-caip-03|I|read job database with 4 entries
in 0 seconds
11/09/2010 14:13:02|  main|rupc-caip-03|E|error opening file
"/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
directory
11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
set to 8192
11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
set to 8192
11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
descriptors for communication
11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will accept max. 99
dynamic event clients
11/09/2010 14:13:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
11/09/2010 14:47:45|worker|rupc-caip-03|W|job 3935.1 failed on host n114
assumedly after job because: job 3935.1 died through signal TERM (15)
11/09/2010 15:11:46|worker|rupc-caip-03|W|job 3938.1 failed on host n114
assumedly after job because: job 3938.1 died through signal TERM (15)
11/09/2010 15:21:10|worker|rupc-caip-03|E|tightly integrated parallel task
3940.1 task 1.n114 failed - killing job
11/09/2010 15:22:26|worker|rupc-caip-03|W|job 3940.1 failed on host n114
assumedly after job because: job 3940.1 died through signal TERM (15)
11/09/2010 15:27:46|worker|rupc-caip-03|W|job 3939.1 failed on host n114
assumedly after job because: job 3939.1 died through signal TERM (15)
11/09/2010 15:33:07|worker|rupc-caip-03|W|job 3943.1 failed on host n114
assumedly after job because: job 3943.1 died through signal TERM (15)
11/09/2010 17:39:03|  main|rupc-caip-03|I|read job database with 7 entries
in 1 seconds
11/09/2010 17:39:03|  main|rupc-caip-03|E|error opening file
"/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
directory
11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster hard descriptor limit is
set to 8192
11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster soft descriptor limit is
set to 8192
11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will use max. 8172 file
descriptors for communication
11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will accept max. 99
dynamic event clients
11/09/2010 17:39:03|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294409

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list