[GE users] qmaster problem on sge 6.5

udo udo at physics.rutgers.edu
Wed Nov 10 15:22:13 GMT 2010


Hi, Andreas,

I think you are 100% right!
If  you see log which I sent in the original e-mail I think crashes will
correlated with tight integrated job finish! Yeah!
Will look at the patch and install.
Let me  try. Otherwise I started to lose confidence in sge.
Actually I was happy with 6.0u4 (very  table release)  the only thing why I
wanted to move towards 6.2 is resources per user!  If I has that luxury at
6.0u4 I wouldn't move forward.
Regards,
v

> -----Original Message-----
> From: ah_sunsource [mailto:ahaupt at ifh.de]
> Sent: Wednesday, November 10, 2010 10:00
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] qmaster problem on sge 6.5
> 
> Hi Udo,
> 
> sounds like the infamous segfault problem of 6.2 ... It occurred at our
> site almost regularly when tightly integrated parallel jobs finished.
> 
> The fix was posted some months ago on this list:
> http://markmail.org/message/njw5ukra3vknfpun - it helped in our case.
> Thanks to Dave again for providing the patch! :-)
> 
> Cheers,
> Andreas
> 
> On Tue, 2010-11-09 at 20:45 -0500, udo wrote:
> > Dear SGE community,
> >
> > Recently I started experience sever problems with qmaster of SGE 6.5.
> > Symptoms like these:
> >
> > [2221661.821388] sge_qmaster[24247] general protection ip:55d12b
> > sp:7ff8d66f78c0 error:0 in sge_qmaster[400000+237000]
> > [2240208.823037] sge_qmaster[12529]: segfault at 24 ip 000000000055c698
sp
> > 00002b23dca06990 error 4 in sge_qmaster[400000+237000]
> > [2243450.822069] sge_qmaster[12770] general protection ip:55c698
> > sp:2acd8fa06990 error:0 in sge_qmaster[400000+237000]
> >
> >
> > [179722.467392] sge_qmaster[5645]: segfault at 24 ip 000000000055c698 sp
> > 00007f81d10f7990 error 4 in sge_qmaster[400000+237000]
> > [181072.467118] sge_qmaster[12839]: segfault at 192d00001934 ip
> > 000000000055c698 sp 00007f90376f7990 error 4 in
sge_qmaster[400000+237000]
> >  [193402.467308] sge_qmaster[13899]: segfault at 7f2c0000000f ip
> > 000000000055c698 sp 00007f2ce40f7990 error 4 in
sge_qmaster[400000+237000]
> >
> > The only what I've done is upgraded from SuSE 10.2 to 11.2 and I need to
say
> > that the SGE server also file server with 4 port bonding  i.e. I have
intel
> > 4-port  gigabit card which I combined to one bonding port. I also have
one
> > more gigabit port which I use for communications with nodes.
> > Cluster is mostly Opteron nodes 8,16, 32 cores.
> >
> > Crash happens one in a while but it can be within a few minutes period
of a
> > few hours period.
> > Right now I am running script which checks if qmaster is running or not
and
> > if not it starts it but it is not the best solution I can imaging.
> >
> > Any suggestions how to cure are very welcome.
> >
> > Regards,
> > Viktor
> > p.s.
> > [20:40:53]udo at rupc-caip-03:~>qstat -v
> > GE 6.2u5
> > usage: qstat [options]
> >
> > below I give all error messages of queueing system for today while
happened
> > 3 or 4 crashes. I can't  see nothing suspeciaous in qmaster logs:
> > /opt/sge6/core/spool/qmaster>tail -130 messages
> >
> >
> > 11/09/2010 01:56:51|worker|rupc-caip-03|W|job 3919.1 failed on host n105
> > assumedly after job because: job 3919.1 died through signal TERM (15)
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|read job database with 6
entries
> > in 0 seconds
> > 11/09/2010 07:39:02|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|read job database with 6
entries
> > in 0 seconds
> > 11/09/2010 08:00:02|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> > 11/09/2010 13:34:25|worker|rupc-caip-03|W|job 3920.1 failed on host n101
> > assumedly after job because: job 3920.1 died through signal TERM (15)
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|read job database with 5
entries
> > in 0 seconds
> > 11/09/2010 13:52:02|  main|rupc-caip-03|W|removing reference to no
longer
> > existing job 3920 of user "sinisa"
> >
> > 11/09/2010 13:52:02|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|read job database with 4
entries
> > in 0 seconds
> > 11/09/2010 14:13:02|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> > 11/09/2010 14:47:45|worker|rupc-caip-03|W|job 3935.1 failed on host n114
> > assumedly after job because: job 3935.1 died through signal TERM (15)
> > 11/09/2010 15:11:46|worker|rupc-caip-03|W|job 3938.1 failed on host n114
> > assumedly after job because: job 3938.1 died through signal TERM (15)
> > 11/09/2010 15:21:10|worker|rupc-caip-03|E|tightly integrated parallel
task
> > 3940.1 task 1.n114 failed - killing job
> > 11/09/2010 15:22:26|worker|rupc-caip-03|W|job 3940.1 failed on host n114
> > assumedly after job because: job 3940.1 died through signal TERM (15)
> > 11/09/2010 15:27:46|worker|rupc-caip-03|W|job 3939.1 failed on host n114
> > assumedly after job because: job 3939.1 died through signal TERM (15)
> > 11/09/2010 15:33:07|worker|rupc-caip-03|W|job 3943.1 failed on host n114
> > assumedly after job because: job 3943.1 died through signal TERM (15)
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|read job database with 7
entries
> > in 1 seconds
> > 11/09/2010 17:39:03|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> >
> > ------------------------------------------------------
> >
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2
94409
> >
> > To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
> --
> | Andreas Haupt             | E-Mail: andreas.haupt at desy.de
> |  DESY Zeuthen             | WWW:    http://www-zeuthen.desy.de/~ahaupt
> |  Platanenallee 6          | Phone:  +49/33762/7-7359
> |  D-15738 Zeuthen          | Fax:    +49/33762/7-7216
> 
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2
94515
> 
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294520

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list