[GE users] sge master dying

Iwona Sakrejda isakrejda at lbl.gov
Thu Jun 14 18:09:13 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Here it includes the thread info...


[root at pc2533 root]# gdb -p 17325
GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu".
Attaching to process 17325
Reading symbols from 
/chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster...done.
Using host libthread_db library "/lib/tls/libthread_db.so.1".
Reading symbols from /lib/libdl.so.2...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/tls/libm.so.6...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libpthread.so.0...done.
[Thread debugging using libthread_db enabled]
[New Thread -1220095328 (LWP 17325)]
[New Thread -1364210768 (LWP 17633)]
[New Thread -1353716816 (LWP 17632)]
[New Thread -1343226960 (LWP 17631)]
[New Thread -1330644048 (LWP 17630)]
[New Thread -1265304656 (LWP 17332)]
[New Thread -1254814800 (LWP 17331)]
[New Thread -1244324944 (LWP 17330)]
[New Thread -1233835088 (LWP 17329)]
[New Thread -1223345232 (LWP 17328)]
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libc.so.6...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from 
/chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
Loaded symbols for /common/sge/6.0u4/lib/lx24-x86/libspoolc.so
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
(gdb) cont
Continuing.

Program received signal SIGBUS, Bus error.
[Switching to Thread -1353716816 (LWP 17632)]
0x0809a007 in hgroup_mod ()
(gdb) where
#0  0x0809a007 in hgroup_mod ()
#1  0x0806d563 in sge_gdi_add_mod_generic ()
#2  0x0806bb40 in sge_c_gdi_mod ()
#3  0x08068cf4 in sge_c_gdi ()
#4  0x080a941c in do_gdi_request ()
#5  0x080a9239 in sge_qmaster_process_message ()
#6  0x0806710c in message_thread ()
#7  0xb75abdd8 in start_thread () from /lib/tls/libpthread.so.0
#8  0xb754ad2a in clone () from /lib/tls/libc.so.6
(gdb) info threads
  10 Thread -1223345232 (LWP 17328)  0xb75ae59b in 
pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/tls/libpthread.so.0
  9 Thread -1233835088 (LWP 17329)  0xb75ae59b in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
  8 Thread -1244324944 (LWP 17330)  0xb75018c1 in gettimeofday () from 
/lib/tls/libc.so.6
  7 Thread -1254814800 (LWP 17331)  0xb75b0939 in __lll_mutex_lock_wait 
() from /lib/tls/libpthread.so.0
  6 Thread -1265304656 (LWP 17332)  0xb75ae59b in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
  5 Thread -1330644048 (LWP 17630)  0xb75ae59b in 
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
  4 Thread -1343226960 (LWP 17631)  0xb75b1c84 in sigwait () from 
/lib/tls/libpthread.so.0
* 3 Thread -1353716816 (LWP 17632)  0x0809a007 in hgroup_mod ()
  2 Thread -1364210768 (LWP 17633)  0xb75b0939 in __lll_mutex_lock_wait 
() from /lib/tls/libpthread.so.0
  1 Thread -1220095328 (LWP 17325)  0xb75acd58 in pthread_join () from 
/lib/tls/libpthread.so.0
(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: 
/chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 17325


Rayson Ho wrote:
> Hi,
>
> I almost forgot that qmaster is threaded... can you use the gdb
> sub-command "info threads" to display the status of all threads??
>
> Rayson
>
>
>
> On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>> Sorry again, wrong paste, I hope this has what you need...
>>
>>
>>
>> Attaching to program: /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
>> process 16919
>> Reading symbols from /lib/libdl.so.2...done.
>> Loaded symbols for /lib/libdl.so.2
>> Reading symbols from /lib/tls/libm.so.6...done.
>> Loaded symbols for /lib/tls/libm.so.6
>> Reading symbols from /lib/tls/libpthread.so.0...done.
>> [Thread debugging using libthread_db enabled]
>> [New Thread -1220095328 (LWP 16919)]
>> [New Thread -1317024848 (LWP 17016)]
>> [New Thread -1306534992 (LWP 17015)]
>> [New Thread -1296041040 (LWP 17014)]
>> [New Thread -1283458128 (LWP 17013)]
>> [New Thread -1265304656 (LWP 16926)]
>> [New Thread -1254814800 (LWP 16925)]
>> [New Thread -1244324944 (LWP 16924)]
>> [New Thread -1233835088 (LWP 16923)]
>> [New Thread -1223345232 (LWP 16922)]
>> Loaded symbols for /lib/tls/libpthread.so.0
>> Reading symbols from /lib/tls/libc.so.6...done.
>> Loaded symbols for /lib/tls/libc.so.6
>> Reading symbols from /lib/ld-linux.so.2...done.
>> Loaded symbols for /lib/ld-linux.so.2
>> Reading symbols from /lib/libnss_files.so.2...done.
>> Loaded symbols for /lib/libnss_files.so.2
>> Reading symbols from
>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>> Loaded symbols for /common/sge/6.0u4/lib/lx24-x86/libspoolc.so
>> Reading symbols from /lib/libnss_dns.so.2...done.
>> Loaded symbols for /lib/libnss_dns.so.2
>> Reading symbols from /lib/libresolv.so.2...done.
>> Loaded symbols for /lib/libresolv.so.2
>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>> (gdb) cont
>> Continuing.
>>
>> Program received signal SIGBUS, Bus error.
>> [Switching to Thread -1317024848 (LWP 17016)]
>> 0x0809a007 in hgroup_mod ()
>> (gdb) where
>> #0  0x0809a007 in hgroup_mod ()
>> #1  0x0806d563 in sge_gdi_add_mod_generic ()
>> #2  0x0806bb40 in sge_c_gdi_mod ()
>> #3  0x08068cf4 in sge_c_gdi ()
>> #4  0x080a941c in do_gdi_request ()
>> #5  0x080a9239 in sge_qmaster_process_message ()
>> #6  0x0806710c in message_thread ()
>> #7  0xb75abdd8 in start_thread () from /lib/tls/libpthread.so.0
>> #8  0xb754ad2a in clone () from /lib/tls/libc.so.6
>> (gdb) quit
>> The program is running.  Quit anyway (and detach it)? (y or n) y
>> Detaching from program:
>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16919
>>
>>
>> Iwona Sakrejda wrote:
>> > Sorry, been a while since I did active development and debugging.
>> > Here it is (and actually I can do qconf for users and queues,
>> > just this qconf for hostgroups is giving me grief...)
>> >
>> >
>> > iwona
>> >
>> >
>> > [root at pc2533 debug]# gdb /common/sge/6.0u4/bin/lx24-x86/sge_qmaster 
>> 16569
>> > GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
>> > Copyright 2004 Free Software Foundation, Inc.
>> > GDB is free software, covered by the GNU General Public License, and
>> > you are
>> > welcome to change it and/or distribute copies of it under certain
>> > conditions.
>> > Type "show copying" to see the conditions.
>> > There is absolutely no warranty for GDB.  Type "show warranty" for
>> > details.
>> > This GDB was configured as "i386-redhat-linux-gnu"...Using host
>> > libthread_db library "/lib/tls/libthread_db.so.1".
>> >
>> > Attaching to program:
>> > /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16569
>> > Reading symbols from /lib/libdl.so.2...done.
>> > Loaded symbols for /lib/libdl.so.2
>> > Reading symbols from /lib/tls/libm.so.6...done.
>> > Loaded symbols for /lib/tls/libm.so.6
>> > Reading symbols from /lib/tls/libpthread.so.0...done.
>> > [Thread debugging using libthread_db enabled]
>> > [New Thread -1220095328 (LWP 16569)]
>> > [New Thread -1317291088 (LWP 16704)]
>> > [New Thread -1306801232 (LWP 16703)]
>> > [New Thread -1296307280 (LWP 16702)]
>> > [New Thread -1285555280 (LWP 16701)]
>> > [New Thread -1265304656 (LWP 16575)]
>> > [New Thread -1254814800 (LWP 16574)]
>> > [New Thread -1244324944 (LWP 16573)]
>> > [New Thread -1233835088 (LWP 16572)]
>> > [New Thread -1223345232 (LWP 16570)]
>> > Loaded symbols for /lib/tls/libpthread.so.0
>> > Reading symbols from /lib/tls/libc.so.6...done.
>> > Loaded symbols for /lib/tls/libc.so.6
>> > Reading symbols from /lib/ld-linux.so.2...done.
>> > Loaded symbols for /lib/ld-linux.so.2
>> > Reading symbols from /lib/libnss_files.so.2...done.
>> > Loaded symbols for /lib/libnss_files.so.2
>> > Reading symbols from
>> > /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>> > Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
>> > Reading symbols from /lib/libnss_dns.so.2...done.
>> > Loaded symbols for /lib/libnss_dns.so.2
>> > Reading symbols from /lib/libresolv.so.2...done.
>> > Loaded symbols for /lib/libresolv.so.2
>> > 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>> > (gdb) cont
>> > Continuing.
>> >
>> > Program received signal SIGBUS, Bus error.
>> > [Switching to Thread -1317291088 (LWP 16704)]
>> > 0x0809a007 in hgroup_mod ()
>> > (gdb) quit
>> >
>> >
>> > Rayson Ho wrote:
>> >> Use the gdb sub-command "where" to show the stack trace...
>> >>
>> >> Rayson
>> >>
>> >>
>> >>
>> >> On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>> >>> Here is what I see when it crashes while attached to gdb:
>> >>> [root at pc2533 debug]# gdb /common/sge/6.0u4/bin/lx24-x86/sge_qmaster
>> >>> 16569
>> >>> GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
>> >>> Copyright 2004 Free Software Foundation, Inc.
>> >>> GDB is free software, covered by the GNU General Public License, and
>> >>> you are
>> >>> welcome to change it and/or distribute copies of it under certain
>> >>> conditions.
>> >>> Type "show copying" to see the conditions.
>> >>> There is absolutely no warranty for GDB.  Type "show warranty" for
>> >>> details.
>> >>> This GDB was configured as "i386-redhat-linux-gnu"...Using host
>> >>> libthread_db library "/lib/tls/libthread_db.so.1".
>> >>>
>> >>> Attaching to program:
>> >>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
>> >>> process 16569
>> >>> Reading symbols from /lib/libdl.so.2...done.
>> >>> Loaded symbols for /lib/libdl.so.2
>> >>> Reading symbols from /lib/tls/libm.so.6...done.
>> >>> Loaded symbols for /lib/tls/libm.so.6
>> >>> Reading symbols from /lib/tls/libpthread.so.0...done.
>> >>> [Thread debugging using libthread_db enabled]
>> >>> [New Thread -1220095328 (LWP 16569)]
>> >>> [New Thread -1317291088 (LWP 16704)]
>> >>> [New Thread -1306801232 (LWP 16703)]
>> >>> [New Thread -1296307280 (LWP 16702)]
>> >>> [New Thread -1285555280 (LWP 16701)]
>> >>> [New Thread -1265304656 (LWP 16575)]
>> >>> [New Thread -1254814800 (LWP 16574)]
>> >>> [New Thread -1244324944 (LWP 16573)]
>> >>> [New Thread -1233835088 (LWP 16572)]
>> >>> [New Thread -1223345232 (LWP 16570)]
>> >>> Loaded symbols for /lib/tls/libpthread.so.0
>> >>> Reading symbols from /lib/tls/libc.so.6...done.
>> >>> Loaded symbols for /lib/tls/libc.so.6
>> >>> Reading symbols from /lib/ld-linux.so.2...done.
>> >>> Loaded symbols for /lib/ld-linux.so.2
>> >>> Reading symbols from /lib/libnss_files.so.2...done.
>> >>> Loaded symbols for /lib/libnss_files.so.2
>> >>> Reading symbols from
>> >>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>> >>> Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
>> >>> Reading symbols from /lib/libnss_dns.so.2...done.
>> >>> Loaded symbols for /lib/libnss_dns.so.2
>> >>> Reading symbols from /lib/libresolv.so.2...done.
>> >>> Loaded symbols for /lib/libresolv.so.2
>> >>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>> >>> (gdb) cont
>> >>> Continuing.
>> >>>
>> >>> Program received signal SIGBUS, Bus error.
>> >>> [Switching to Thread -1317291088 (LWP 16704)]
>> >>> 0x0809a007 in hgroup_mod ()
>> >>> (gdb) quit
>> >>>
>> >>>
>> >>>
>> >>> Rayson Ho wrote:
>> >>> > Can you attach qmaster with a debugger, so that we can get the 
>> stack
>> >>> > trace when it dies??
>> >>> >
>> >>> > Rayson
>> >>> >
>> >>> >
>> >>> >
>> >>> > On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>> >>> >> Hi,
>> >>> >>
>> >>> >> I an running SGE 6.0u4 on rhel3 and It's been running ok for a 
>> year
>> >>> >> or so.
>> >>> >> Last week i tried qconf -mhgrp and this command repeatedly kills
>> >>> all the
>> >>> >> sge processes on the headnode. I connected with strace to the
>> >>> sgeadmin
>> >>> >> before it died and I only see:
>> >>> >> rocess 16727 attached - interrupt to quit
>> >>> >> futex(0xb03bebf8, FUTEX_WAIT, 16822, NULL) = -1 EINTR 
>> (Interrupted
>> >>> >> system call)
>> >>> >> +++ killed by SIGBUS +++
>> >>> >>
>> >>> >> Nothing exciting in the logs, it's just going about its 
>> bussiness...
>> >>> >>
>> >>> >> Suggestions on how to approch this problem would be 
>> appreciated...
>> >>> >>
>> >>> >> Thank You,
>> >>> >>
>> >>> >> iwona
>> >>> >>
>> >>> >>
>> >>> 
>> ---------------------------------------------------------------------
>> >>> >> To unsubscribe, e-mail: 
>> users-unsubscribe at gridengine.sunsource.net
>> >>> >> For additional commands, e-mail: 
>> users-help at gridengine.sunsource.net
>> >>> >>
>> >>> >>
>> >>> >
>> >>> > 
>> ---------------------------------------------------------------------
>> >>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >>> > For additional commands, e-mail: 
>> users-help at gridengine.sunsource.net
>> >>>
>> >>> 
>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >>>
>> >>>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list