[GE users] qmaster problem on sge 6.5

ah_sunsource ahaupt at ifh.de
Wed Nov 10 15:00:27 GMT 2010


Hi Udo,

sounds like the infamous segfault problem of 6.2 ... It occurred at our
site almost regularly when tightly integrated parallel jobs finished.

The fix was posted some months ago on this list:
http://markmail.org/message/njw5ukra3vknfpun - it helped in our case.
Thanks to Dave again for providing the patch! :-)

Cheers,
Andreas

On Tue, 2010-11-09 at 20:45 -0500, udo wrote:
> Dear SGE community,
> 
> Recently I started experience sever problems with qmaster of SGE 6.5.
> Symptoms like these:
> 
> [2221661.821388] sge_qmaster[24247] general protection ip:55d12b
> sp:7ff8d66f78c0 error:0 in sge_qmaster[400000+237000]
> [2240208.823037] sge_qmaster[12529]: segfault at 24 ip 000000000055c698 sp
> 00002b23dca06990 error 4 in sge_qmaster[400000+237000]
> [2243450.822069] sge_qmaster[12770] general protection ip:55c698
> sp:2acd8fa06990 error:0 in sge_qmaster[400000+237000]
> 
> 
> [179722.467392] sge_qmaster[5645]: segfault at 24 ip 000000000055c698 sp
> 00007f81d10f7990 error 4 in sge_qmaster[400000+237000]
> [181072.467118] sge_qmaster[12839]: segfault at 192d00001934 ip
> 000000000055c698 sp 00007f90376f7990 error 4 in sge_qmaster[400000+237000]
>  [193402.467308] sge_qmaster[13899]: segfault at 7f2c0000000f ip
> 000000000055c698 sp 00007f2ce40f7990 error 4 in sge_qmaster[400000+237000]
> 
> The only what I've done is upgraded from SuSE 10.2 to 11.2 and I need to say
> that the SGE server also file server with 4 port bonding  i.e. I have intel
> 4-port  gigabit card which I combined to one bonding port. I also have one
> more gigabit port which I use for communications with nodes.
> Cluster is mostly Opteron nodes 8,16, 32 cores. 
> 
> Crash happens one in a while but it can be within a few minutes period of a
> few hours period.
> Right now I am running script which checks if qmaster is running or not and
> if not it starts it but it is not the best solution I can imaging.
> 
> Any suggestions how to cure are very welcome.
>  
> Regards,
> Viktor
> p.s.
> [20:40:53]udo at rupc-caip-03:~>qstat -v
> GE 6.2u5
> usage: qstat [options]
> 
> below I give all error messages of queueing system for today while happened
> 3 or 4 crashes. I can't  see nothing suspeciaous in qmaster logs:
> /opt/sge6/core/spool/qmaster>tail -130 messages
> 
> 
> 11/09/2010 01:56:51|worker|rupc-caip-03|W|job 3919.1 failed on host n105
> assumedly after job because: job 3919.1 died through signal TERM (15)
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|read job database with 6 entries
> in 0 seconds
> 11/09/2010 07:39:02|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 07:39:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|read job database with 6 entries
> in 0 seconds
> 11/09/2010 08:00:02|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 08:00:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 11/09/2010 13:34:25|worker|rupc-caip-03|W|job 3920.1 failed on host n101
> assumedly after job because: job 3920.1 died through signal TERM (15)
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|read job database with 5 entries
> in 0 seconds
> 11/09/2010 13:52:02|  main|rupc-caip-03|W|removing reference to no longer
> existing job 3920 of user "sinisa"
> 
> 11/09/2010 13:52:02|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 13:52:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|read job database with 4 entries
> in 0 seconds
> 11/09/2010 14:13:02|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 14:13:02|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 11/09/2010 14:47:45|worker|rupc-caip-03|W|job 3935.1 failed on host n114
> assumedly after job because: job 3935.1 died through signal TERM (15)
> 11/09/2010 15:11:46|worker|rupc-caip-03|W|job 3938.1 failed on host n114
> assumedly after job because: job 3938.1 died through signal TERM (15)
> 11/09/2010 15:21:10|worker|rupc-caip-03|E|tightly integrated parallel task
> 3940.1 task 1.n114 failed - killing job
> 11/09/2010 15:22:26|worker|rupc-caip-03|W|job 3940.1 failed on host n114
> assumedly after job because: job 3940.1 died through signal TERM (15)
> 11/09/2010 15:27:46|worker|rupc-caip-03|W|job 3939.1 failed on host n114
> assumedly after job because: job 3939.1 died through signal TERM (15)
> 11/09/2010 15:33:07|worker|rupc-caip-03|W|job 3943.1 failed on host n114
> assumedly after job because: job 3943.1 died through signal TERM (15)
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|read job database with 7 entries
> in 1 seconds
> 11/09/2010 17:39:03|  main|rupc-caip-03|E|error opening file
> "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> directory
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster hard descriptor limit is
> set to 8192
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster soft descriptor limit is
> set to 8192
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will use max. 8172 file
> descriptors for communication
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will accept max. 99
> dynamic event clients
> 11/09/2010 17:39:03|  main|rupc-caip-03|I|starting up GE 6.2u5 (lx24-amd64)
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294409
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
-- 
| Andreas Haupt             | E-Mail: andreas.haupt at desy.de
|  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6          | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen          | Fax:    +49/33762/7-7216

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294515

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list