[GE users] qmaster problem on sge 6.5

udo udo at physics.rutgers.edu
Wed Nov 10 06:24:09 GMT 2010


32 bit SuSE 10.2 has (I just do not have 64 bit version left to compare)
[01:16:37]udo at rupc01:~>ls -l /lib/libc-2.5.so
-rwxr-xr-x 1 root root 1491141 2007-11-21 12:36 /lib/libc-2.5.so

:~>uname -a
Linux rupc01 2.6.31.5-default #4 SMP Sun Jul 18 08:04:30 EDT 2010 i686
athlon i386 GNU/Linux
:~>more /etc/issue
Welcome to openSUSE 10.2 (i586) - Kernel \r (\l).


While 11.2 64bit has:

:~>uname -a
Linux rupc-caip-03 2.6.31.14-0.4-desktop #1 SMP PREEMPT 2010-10-25 08:45:30
+0200 x86_64 x86_64 x86_64 GNU/Linux
:~>more /etc/issue
Welcome to openSUSE 11.2 "Emerald" - Kernel \r (\l).

:~>ls -l /lib64/libc-2.10.1.so
-rwxr-xr-x 1 root root 1408560 2010-10-27 03:34 /lib64/libc-2.10.1.so

What should be done?
Does reinstallation help or ?
Best,
v


> -----Original Message-----
> From: jmarshall [mailto:John.Marshall at ec.gc.ca]
> Sent: Tuesday, November 09, 2010 20:56
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] qmaster problem on sge 6.5
> 
> On 11/09/2010 08:45 PM, udo wrote:
> > Dear SGE community,
> >
> > Recently I started experience sever problems with qmaster of SGE 6.5.
> > Symptoms like these:
> >
> > [2221661.821388] sge_qmaster[24247] general protection ip:55d12b
> > sp:7ff8d66f78c0 error:0 in sge_qmaster[400000+237000]
> > [2240208.823037] sge_qmaster[12529]: segfault at 24 ip 000000000055c698
sp
> > 00002b23dca06990 error 4 in sge_qmaster[400000+237000]
> > [2243450.822069] sge_qmaster[12770] general protection ip:55c698
> > sp:2acd8fa06990 error:0 in sge_qmaster[400000+237000]
> >
> >
> > [179722.467392] sge_qmaster[5645]: segfault at 24 ip 000000000055c698 sp
> > 00007f81d10f7990 error 4 in sge_qmaster[400000+237000]
> > [181072.467118] sge_qmaster[12839]: segfault at 192d00001934 ip
> > 000000000055c698 sp 00007f90376f7990 error 4 in
sge_qmaster[400000+237000]
> >   [193402.467308] sge_qmaster[13899]: segfault at 7f2c0000000f ip
> > 000000000055c698 sp 00007f2ce40f7990 error 4 in
sge_qmaster[400000+237000]
> >
> > The only what I've done is upgraded from SuSE 10.2 to 11.2 and I need to
say
> That sounds like a big change. Check the difference in libc versions. It
might be the
> problem.
> 
> John
> > that the SGE server also file server with 4 port bonding  i.e. I have
intel
> > 4-port  gigabit card which I combined to one bonding port. I also have
one
> > more gigabit port which I use for communications with nodes.
> > Cluster is mostly Opteron nodes 8,16, 32 cores.
> >
> > Crash happens one in a while but it can be within a few minutes period
of a
> > few hours period.
> > Right now I am running script which checks if qmaster is running or not
and
> > if not it starts it but it is not the best solution I can imaging.
> >
> > Any suggestions how to cure are very welcome.
> >
> > Regards,
> > Viktor
> > p.s.
> > [20:40:53]udo at rupc-caip-03:~>qstat -v
> > GE 6.2u5
> > usage: qstat [options]
> >
> > below I give all error messages of queueing system for today while
happened
> > 3 or 4 crashes. I can't  see nothing suspeciaous in qmaster logs:
> > /opt/sge6/core/spool/qmaster>tail -130 messages
> >
> >
> > 11/09/2010 01:56:51|worker|rupc-caip-03|W|job 3919.1 failed on host n105
> > assumedly after job because: job 3919.1 died through signal TERM (15)
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|read job database with 6
entries
> > in 0 seconds
> > 11/09/2010 07:39:02|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 07:39:02|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|read job database with 6
entries
> > in 0 seconds
> > 11/09/2010 08:00:02|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 08:00:02|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> > 11/09/2010 13:34:25|worker|rupc-caip-03|W|job 3920.1 failed on host n101
> > assumedly after job because: job 3920.1 died through signal TERM (15)
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|read job database with 5
entries
> > in 0 seconds
> > 11/09/2010 13:52:02|  main|rupc-caip-03|W|removing reference to no
longer
> > existing job 3920 of user "sinisa"
> >
> > 11/09/2010 13:52:02|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 13:52:02|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|read job database with 4
entries
> > in 0 seconds
> > 11/09/2010 14:13:02|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 14:13:02|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> > 11/09/2010 14:47:45|worker|rupc-caip-03|W|job 3935.1 failed on host n114
> > assumedly after job because: job 3935.1 died through signal TERM (15)
> > 11/09/2010 15:11:46|worker|rupc-caip-03|W|job 3938.1 failed on host n114
> > assumedly after job because: job 3938.1 died through signal TERM (15)
> > 11/09/2010 15:21:10|worker|rupc-caip-03|E|tightly integrated parallel
task
> > 3940.1 task 1.n114 failed - killing job
> > 11/09/2010 15:22:26|worker|rupc-caip-03|W|job 3940.1 failed on host n114
> > assumedly after job because: job 3940.1 died through signal TERM (15)
> > 11/09/2010 15:27:46|worker|rupc-caip-03|W|job 3939.1 failed on host n114
> > assumedly after job because: job 3939.1 died through signal TERM (15)
> > 11/09/2010 15:33:07|worker|rupc-caip-03|W|job 3943.1 failed on host n114
> > assumedly after job because: job 3943.1 died through signal TERM (15)
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|read job database with 7
entries
> > in 1 seconds
> > 11/09/2010 17:39:03|  main|rupc-caip-03|E|error opening file
> > "/opt/sge6/core/spool/qmaster/./sharetree" for reading: No such file or
> > directory
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster hard descriptor limit
is
> > set to 8192
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster soft descriptor limit
is
> > set to 8192
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will use max. 8172
file
> > descriptors for communication
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|qmaster will accept max. 99
> > dynamic event clients
> > 11/09/2010 17:39:03|  main|rupc-caip-03|I|starting up GE 6.2u5
(lx24-amd64)
> >
> > ------------------------------------------------------
> >
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2
94409
> >
> > To unsubscribe from this discussion, e-mail: [users-
> unsubscribe at gridengine.sunsource.net].
> 
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=2
94411
> 
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=294443

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list