[GE users] sge master dying

Iwona Sakrejda isakrejda at lbl.gov
Wed Jun 13 19:05:37 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Sorry again, wrong paste, I hope this has what you need...



Attaching to program: /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, 
process 16919
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread -1220095328 (LWP 16919)]
[New Thread -1317024848 (LWP 17016)]
[New Thread -1306534992 (LWP 17015)]
[New Thread -1296041040 (LWP 17014)]
[New Thread -1283458128 (LWP 17013)]
[New Thread -1265304656 (LWP 16926)]
[New Thread -1254814800 (LWP 16925)]
[New Thread -1244324944 (LWP 16924)]
[New Thread -1233835088 (LWP 16923)]
[New Thread -1223345232 (LWP 16922)]
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from 
/chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
Loaded symbols for /common/sge/6.0u4/lib/lx24-x86/libspoolc.so
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
(gdb) cont
Continuing.

Program received signal SIGBUS, Bus error.
[Switching to Thread -1317024848 (LWP 17016)]
0x0809a007 in hgroup_mod ()
(gdb) where
#0  0x0809a007 in hgroup_mod ()
#1  0x0806d563 in sge_gdi_add_mod_generic ()
#2  0x0806bb40 in sge_c_gdi_mod ()
#3  0x08068cf4 in sge_c_gdi ()
#4  0x080a941c in do_gdi_request ()
#5  0x080a9239 in sge_qmaster_process_message ()
#6  0x0806710c in message_thread ()
#7  0xb75abdd8 in start_thread () from /lib/tls/libpthread.so.0
#8  0xb754ad2a in clone () from /lib/tls/libc.so.6
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: 
/chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16919


Iwona Sakrejda wrote:
> Sorry, been a while since I did active development and debugging.
> Here it is (and actually I can do qconf for users and queues,
> just this qconf for hostgroups is giving me grief...)
>
>
> iwona
>
>
> [root at pc2533 debug]# gdb /common/sge/6.0u4/bin/lx24-x86/sge_qmaster 16569
> GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and 
> you are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for 
> details.
> This GDB was configured as "i386-redhat-linux-gnu"...Using host 
> libthread_db library "/lib/tls/libthread_db.so.1".
>
> Attaching to program: 
> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16569
> Reading symbols from /lib/libdl.so.2...done.
> Loaded symbols for /lib/libdl.so.2
> Reading symbols from /lib/tls/libm.so.6...done.
> Loaded symbols for /lib/tls/libm.so.6
> Reading symbols from /lib/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread -1220095328 (LWP 16569)]
> [New Thread -1317291088 (LWP 16704)]
> [New Thread -1306801232 (LWP 16703)]
> [New Thread -1296307280 (LWP 16702)]
> [New Thread -1285555280 (LWP 16701)]
> [New Thread -1265304656 (LWP 16575)]
> [New Thread -1254814800 (LWP 16574)]
> [New Thread -1244324944 (LWP 16573)]
> [New Thread -1233835088 (LWP 16572)]
> [New Thread -1223345232 (LWP 16570)]
> Loaded symbols for /lib/tls/libpthread.so.0
> Reading symbols from /lib/tls/libc.so.6...done.
> Loaded symbols for /lib/tls/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> Reading symbols from 
> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
> Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
> Reading symbols from /lib/libnss_dns.so.2...done.
> Loaded symbols for /lib/libnss_dns.so.2
> Reading symbols from /lib/libresolv.so.2...done.
> Loaded symbols for /lib/libresolv.so.2
> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
> (gdb) cont
> Continuing.
>
> Program received signal SIGBUS, Bus error.
> [Switching to Thread -1317291088 (LWP 16704)]
> 0x0809a007 in hgroup_mod ()
> (gdb) quit
>
>
> Rayson Ho wrote:
>> Use the gdb sub-command "where" to show the stack trace...
>>
>> Rayson
>>
>>
>>
>> On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>> Here is what I see when it crashes while attached to gdb:
>>> [root at pc2533 debug]# gdb /common/sge/6.0u4/bin/lx24-x86/sge_qmaster 
>>> 16569
>>> GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
>>> Copyright 2004 Free Software Foundation, Inc.
>>> GDB is free software, covered by the GNU General Public License, and 
>>> you are
>>> welcome to change it and/or distribute copies of it under certain
>>> conditions.
>>> Type "show copying" to see the conditions.
>>> There is absolutely no warranty for GDB.  Type "show warranty" for 
>>> details.
>>> This GDB was configured as "i386-redhat-linux-gnu"...Using host
>>> libthread_db library "/lib/tls/libthread_db.so.1".
>>>
>>> Attaching to program: 
>>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
>>> process 16569
>>> Reading symbols from /lib/libdl.so.2...done.
>>> Loaded symbols for /lib/libdl.so.2
>>> Reading symbols from /lib/tls/libm.so.6...done.
>>> Loaded symbols for /lib/tls/libm.so.6
>>> Reading symbols from /lib/tls/libpthread.so.0...done.
>>> [Thread debugging using libthread_db enabled]
>>> [New Thread -1220095328 (LWP 16569)]
>>> [New Thread -1317291088 (LWP 16704)]
>>> [New Thread -1306801232 (LWP 16703)]
>>> [New Thread -1296307280 (LWP 16702)]
>>> [New Thread -1285555280 (LWP 16701)]
>>> [New Thread -1265304656 (LWP 16575)]
>>> [New Thread -1254814800 (LWP 16574)]
>>> [New Thread -1244324944 (LWP 16573)]
>>> [New Thread -1233835088 (LWP 16572)]
>>> [New Thread -1223345232 (LWP 16570)]
>>> Loaded symbols for /lib/tls/libpthread.so.0
>>> Reading symbols from /lib/tls/libc.so.6...done.
>>> Loaded symbols for /lib/tls/libc.so.6
>>> Reading symbols from /lib/ld-linux.so.2...done.
>>> Loaded symbols for /lib/ld-linux.so.2
>>> Reading symbols from /lib/libnss_files.so.2...done.
>>> Loaded symbols for /lib/libnss_files.so.2
>>> Reading symbols from
>>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>>> Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
>>> Reading symbols from /lib/libnss_dns.so.2...done.
>>> Loaded symbols for /lib/libnss_dns.so.2
>>> Reading symbols from /lib/libresolv.so.2...done.
>>> Loaded symbols for /lib/libresolv.so.2
>>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>>> (gdb) cont
>>> Continuing.
>>>
>>> Program received signal SIGBUS, Bus error.
>>> [Switching to Thread -1317291088 (LWP 16704)]
>>> 0x0809a007 in hgroup_mod ()
>>> (gdb) quit
>>>
>>>
>>>
>>> Rayson Ho wrote:
>>> > Can you attach qmaster with a debugger, so that we can get the stack
>>> > trace when it dies??
>>> >
>>> > Rayson
>>> >
>>> >
>>> >
>>> > On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>> >> Hi,
>>> >>
>>> >> I an running SGE 6.0u4 on rhel3 and It's been running ok for a year
>>> >> or so.
>>> >> Last week i tried qconf -mhgrp and this command repeatedly kills 
>>> all the
>>> >> sge processes on the headnode. I connected with strace to the 
>>> sgeadmin
>>> >> before it died and I only see:
>>> >> rocess 16727 attached - interrupt to quit
>>> >> futex(0xb03bebf8, FUTEX_WAIT, 16822, NULL) = -1 EINTR (Interrupted
>>> >> system call)
>>> >> +++ killed by SIGBUS +++
>>> >>
>>> >> Nothing exciting in the logs, it's just going about its bussiness...
>>> >>
>>> >> Suggestions on how to approch this problem would be appreciated...
>>> >>
>>> >> Thank You,
>>> >>
>>> >> iwona
>>> >>
>>> >> 
>>> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >>
>>> >>
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list