[GE users] sge master dying

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Mon Jun 18 09:48:24 BST 2007


Hi Iwona,

you should upgrade to the most recent 6.0u10 patch version. 
Since 6.0u4 there were many changes in the source module

    daemons/qmaster/sge_hgroup_qmaster.c

where the crashing hgroup_mod() is implemented.

Regards,
Andreas


On Thu, 14 Jun 2007, Iwona Sakrejda wrote:

> Here it includes the thread info...
>
>
> [root at pc2533 root]# gdb -p 17325
> GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-redhat-linux-gnu".
> Attaching to process 17325
> Reading symbols from 
> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster...done.
> Using host libthread_db library "/lib/tls/libthread_db.so.1".
> Reading symbols from /lib/libdl.so.2...done.
> Loaded symbols for /lib/libdl.so.2
> Reading symbols from /lib/tls/libm.so.6...done.
> Loaded symbols for /lib/tls/libm.so.6
> Reading symbols from /lib/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread -1220095328 (LWP 17325)]
> [New Thread -1364210768 (LWP 17633)]
> [New Thread -1353716816 (LWP 17632)]
> [New Thread -1343226960 (LWP 17631)]
> [New Thread -1330644048 (LWP 17630)]
> [New Thread -1265304656 (LWP 17332)]
> [New Thread -1254814800 (LWP 17331)]
> [New Thread -1244324944 (LWP 17330)]
> [New Thread -1233835088 (LWP 17329)]
> [New Thread -1223345232 (LWP 17328)]
> Loaded symbols for /lib/tls/libpthread.so.0
> Reading symbols from /lib/tls/libc.so.6...done.
> Loaded symbols for /lib/tls/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> Reading symbols from 
> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
> Loaded symbols for /common/sge/6.0u4/lib/lx24-x86/libspoolc.so
> Reading symbols from /lib/libnss_dns.so.2...done.
> Loaded symbols for /lib/libnss_dns.so.2
> Reading symbols from /lib/libresolv.so.2...done.
> Loaded symbols for /lib/libresolv.so.2
> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
> (gdb) cont
> Continuing.
>
> Program received signal SIGBUS, Bus error.
> [Switching to Thread -1353716816 (LWP 17632)]
> 0x0809a007 in hgroup_mod ()
> (gdb) where
> #0  0x0809a007 in hgroup_mod ()
> #1  0x0806d563 in sge_gdi_add_mod_generic ()
> #2  0x0806bb40 in sge_c_gdi_mod ()
> #3  0x08068cf4 in sge_c_gdi ()
> #4  0x080a941c in do_gdi_request ()
> #5  0x080a9239 in sge_qmaster_process_message ()
> #6  0x0806710c in message_thread ()
> #7  0xb75abdd8 in start_thread () from /lib/tls/libpthread.so.0
> #8  0xb754ad2a in clone () from /lib/tls/libc.so.6
> (gdb) info threads
> 10 Thread -1223345232 (LWP 17328)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>  from /lib/tls/libpthread.so.0
> 9 Thread -1233835088 (LWP 17329)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
> 8 Thread -1244324944 (LWP 17330)  0xb75018c1 in gettimeofday () from 
> /lib/tls/libc.so.6
> 7 Thread -1254814800 (LWP 17331)  0xb75b0939 in __lll_mutex_lock_wait () 
> from /lib/tls/libpthread.so.0
> 6 Thread -1265304656 (LWP 17332)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
> 5 Thread -1330644048 (LWP 17630)  0xb75ae59b in 
> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
> 4 Thread -1343226960 (LWP 17631)  0xb75b1c84 in sigwait () from 
> /lib/tls/libpthread.so.0
> * 3 Thread -1353716816 (LWP 17632)  0x0809a007 in hgroup_mod ()
> 2 Thread -1364210768 (LWP 17633)  0xb75b0939 in __lll_mutex_lock_wait () 
> from /lib/tls/libpthread.so.0
> 1 Thread -1220095328 (LWP 17325)  0xb75acd58 in pthread_join () from 
> /lib/tls/libpthread.so.0
> (gdb) quit
> The program is running.  Quit anyway (and detach it)? (y or n) y
> Detaching from program: /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, 
> process 17325
>
>
> Rayson Ho wrote:
>> Hi,
>> 
>> I almost forgot that qmaster is threaded... can you use the gdb
>> sub-command "info threads" to display the status of all threads??
>> 
>> Rayson
>> 
>> 
>> 
>> On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>> Sorry again, wrong paste, I hope this has what you need...
>>> 
>>> 
>>> 
>>> Attaching to program: /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
>>> process 16919
>>> Reading symbols from /lib/libdl.so.2...done.
>>> Loaded symbols for /lib/libdl.so.2
>>> Reading symbols from /lib/tls/libm.so.6...done.
>>> Loaded symbols for /lib/tls/libm.so.6
>>> Reading symbols from /lib/tls/libpthread.so.0...done.
>>> [Thread debugging using libthread_db enabled]
>>> [New Thread -1220095328 (LWP 16919)]
>>> [New Thread -1317024848 (LWP 17016)]
>>> [New Thread -1306534992 (LWP 17015)]
>>> [New Thread -1296041040 (LWP 17014)]
>>> [New Thread -1283458128 (LWP 17013)]
>>> [New Thread -1265304656 (LWP 16926)]
>>> [New Thread -1254814800 (LWP 16925)]
>>> [New Thread -1244324944 (LWP 16924)]
>>> [New Thread -1233835088 (LWP 16923)]
>>> [New Thread -1223345232 (LWP 16922)]
>>> Loaded symbols for /lib/tls/libpthread.so.0
>>> Reading symbols from /lib/tls/libc.so.6...done.
>>> Loaded symbols for /lib/tls/libc.so.6
>>> Reading symbols from /lib/ld-linux.so.2...done.
>>> Loaded symbols for /lib/ld-linux.so.2
>>> Reading symbols from /lib/libnss_files.so.2...done.
>>> Loaded symbols for /lib/libnss_files.so.2
>>> Reading symbols from
>>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>>> Loaded symbols for /common/sge/6.0u4/lib/lx24-x86/libspoolc.so
>>> Reading symbols from /lib/libnss_dns.so.2...done.
>>> Loaded symbols for /lib/libnss_dns.so.2
>>> Reading symbols from /lib/libresolv.so.2...done.
>>> Loaded symbols for /lib/libresolv.so.2
>>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>>> (gdb) cont
>>> Continuing.
>>> 
>>> Program received signal SIGBUS, Bus error.
>>> [Switching to Thread -1317024848 (LWP 17016)]
>>> 0x0809a007 in hgroup_mod ()
>>> (gdb) where
>>> #0  0x0809a007 in hgroup_mod ()
>>> #1  0x0806d563 in sge_gdi_add_mod_generic ()
>>> #2  0x0806bb40 in sge_c_gdi_mod ()
>>> #3  0x08068cf4 in sge_c_gdi ()
>>> #4  0x080a941c in do_gdi_request ()
>>> #5  0x080a9239 in sge_qmaster_process_message ()
>>> #6  0x0806710c in message_thread ()
>>> #7  0xb75abdd8 in start_thread () from /lib/tls/libpthread.so.0
>>> #8  0xb754ad2a in clone () from /lib/tls/libc.so.6
>>> (gdb) quit
>>> The program is running.  Quit anyway (and detach it)? (y or n) y
>>> Detaching from program:
>>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16919
>>> 
>>> 
>>> Iwona Sakrejda wrote:
>>> > Sorry, been a while since I did active development and debugging.
>>> > Here it is (and actually I can do qconf for users and queues,
>>> > just this qconf for hostgroups is giving me grief...)
>>> >
>>> >
>>> > iwona
>>> >
>>> >
>>> > [root at pc2533 debug]# gdb /common/sge/6.0u4/bin/lx24-x86/sge_qmaster 
>>> 16569
>>> > GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
>>> > Copyright 2004 Free Software Foundation, Inc.
>>> > GDB is free software, covered by the GNU General Public License, and
>>> > you are
>>> > welcome to change it and/or distribute copies of it under certain
>>> > conditions.
>>> > Type "show copying" to see the conditions.
>>> > There is absolutely no warranty for GDB.  Type "show warranty" for
>>> > details.
>>> > This GDB was configured as "i386-redhat-linux-gnu"...Using host
>>> > libthread_db library "/lib/tls/libthread_db.so.1".
>>> >
>>> > Attaching to program:
>>> > /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16569
>>> > Reading symbols from /lib/libdl.so.2...done.
>>> > Loaded symbols for /lib/libdl.so.2
>>> > Reading symbols from /lib/tls/libm.so.6...done.
>>> > Loaded symbols for /lib/tls/libm.so.6
>>> > Reading symbols from /lib/tls/libpthread.so.0...done.
>>> > [Thread debugging using libthread_db enabled]
>>> > [New Thread -1220095328 (LWP 16569)]
>>> > [New Thread -1317291088 (LWP 16704)]
>>> > [New Thread -1306801232 (LWP 16703)]
>>> > [New Thread -1296307280 (LWP 16702)]
>>> > [New Thread -1285555280 (LWP 16701)]
>>> > [New Thread -1265304656 (LWP 16575)]
>>> > [New Thread -1254814800 (LWP 16574)]
>>> > [New Thread -1244324944 (LWP 16573)]
>>> > [New Thread -1233835088 (LWP 16572)]
>>> > [New Thread -1223345232 (LWP 16570)]
>>> > Loaded symbols for /lib/tls/libpthread.so.0
>>> > Reading symbols from /lib/tls/libc.so.6...done.
>>> > Loaded symbols for /lib/tls/libc.so.6
>>> > Reading symbols from /lib/ld-linux.so.2...done.
>>> > Loaded symbols for /lib/ld-linux.so.2
>>> > Reading symbols from /lib/libnss_files.so.2...done.
>>> > Loaded symbols for /lib/libnss_files.so.2
>>> > Reading symbols from
>>> > /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>>> > Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
>>> > Reading symbols from /lib/libnss_dns.so.2...done.
>>> > Loaded symbols for /lib/libnss_dns.so.2
>>> > Reading symbols from /lib/libresolv.so.2...done.
>>> > Loaded symbols for /lib/libresolv.so.2
>>> > 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>>> > (gdb) cont
>>> > Continuing.
>>> >
>>> > Program received signal SIGBUS, Bus error.
>>> > [Switching to Thread -1317291088 (LWP 16704)]
>>> > 0x0809a007 in hgroup_mod ()
>>> > (gdb) quit
>>> >
>>> >
>>> > Rayson Ho wrote:
>>> >> Use the gdb sub-command "where" to show the stack trace...
>>> >>
>>> >> Rayson
>>> >>
>>> >>
>>> >>
>>> >> On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>> >>> Here is what I see when it crashes while attached to gdb:
>>> >>> [root at pc2533 debug]# gdb /common/sge/6.0u4/bin/lx24-x86/sge_qmaster
>>> >>> 16569
>>> >>> GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
>>> >>> Copyright 2004 Free Software Foundation, Inc.
>>> >>> GDB is free software, covered by the GNU General Public License, and
>>> >>> you are
>>> >>> welcome to change it and/or distribute copies of it under certain
>>> >>> conditions.
>>> >>> Type "show copying" to see the conditions.
>>> >>> There is absolutely no warranty for GDB.  Type "show warranty" for
>>> >>> details.
>>> >>> This GDB was configured as "i386-redhat-linux-gnu"...Using host
>>> >>> libthread_db library "/lib/tls/libthread_db.so.1".
>>> >>>
>>> >>> Attaching to program:
>>> >>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
>>> >>> process 16569
>>> >>> Reading symbols from /lib/libdl.so.2...done.
>>> >>> Loaded symbols for /lib/libdl.so.2
>>> >>> Reading symbols from /lib/tls/libm.so.6...done.
>>> >>> Loaded symbols for /lib/tls/libm.so.6
>>> >>> Reading symbols from /lib/tls/libpthread.so.0...done.
>>> >>> [Thread debugging using libthread_db enabled]
>>> >>> [New Thread -1220095328 (LWP 16569)]
>>> >>> [New Thread -1317291088 (LWP 16704)]
>>> >>> [New Thread -1306801232 (LWP 16703)]
>>> >>> [New Thread -1296307280 (LWP 16702)]
>>> >>> [New Thread -1285555280 (LWP 16701)]
>>> >>> [New Thread -1265304656 (LWP 16575)]
>>> >>> [New Thread -1254814800 (LWP 16574)]
>>> >>> [New Thread -1244324944 (LWP 16573)]
>>> >>> [New Thread -1233835088 (LWP 16572)]
>>> >>> [New Thread -1223345232 (LWP 16570)]
>>> >>> Loaded symbols for /lib/tls/libpthread.so.0
>>> >>> Reading symbols from /lib/tls/libc.so.6...done.
>>> >>> Loaded symbols for /lib/tls/libc.so.6
>>> >>> Reading symbols from /lib/ld-linux.so.2...done.
>>> >>> Loaded symbols for /lib/ld-linux.so.2
>>> >>> Reading symbols from /lib/libnss_files.so.2...done.
>>> >>> Loaded symbols for /lib/libnss_files.so.2
>>> >>> Reading symbols from
>>> >>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>>> >>> Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
>>> >>> Reading symbols from /lib/libnss_dns.so.2...done.
>>> >>> Loaded symbols for /lib/libnss_dns.so.2
>>> >>> Reading symbols from /lib/libresolv.so.2...done.
>>> >>> Loaded symbols for /lib/libresolv.so.2
>>> >>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>>> >>> (gdb) cont
>>> >>> Continuing.
>>> >>>
>>> >>> Program received signal SIGBUS, Bus error.
>>> >>> [Switching to Thread -1317291088 (LWP 16704)]
>>> >>> 0x0809a007 in hgroup_mod ()
>>> >>> (gdb) quit
>>> >>>
>>> >>>
>>> >>>
>>> >>> Rayson Ho wrote:
>>> >>> > Can you attach qmaster with a debugger, so that we can get the stack
>>> >>> > trace when it dies??
>>> >>> >
>>> >>> > Rayson
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>> >>> >> Hi,
>>> >>> >>
>>> >>> >> I an running SGE 6.0u4 on rhel3 and It's been running ok for a year
>>> >>> >> or so.
>>> >>> >> Last week i tried qconf -mhgrp and this command repeatedly kills
>>> >>> all the
>>> >>> >> sge processes on the headnode. I connected with strace to the
>>> >>> sgeadmin
>>> >>> >> before it died and I only see:
>>> >>> >> rocess 16727 attached - interrupt to quit
>>> >>> >> futex(0xb03bebf8, FUTEX_WAIT, 16822, NULL) = -1 EINTR (Interrupted
>>> >>> >> system call)
>>> >>> >> +++ killed by SIGBUS +++
>>> >>> >>
>>> >>> >> Nothing exciting in the logs, it's just going about its 
>>> bussiness...
>>> >>> >>
>>> >>> >> Suggestions on how to approch this problem would be appreciated...
>>> >>> >>
>>> >>> >> Thank You,
>>> >>> >>
>>> >>> >> iwona
>>> >>> >>
>>> >>> >>
>>> >>> ---------------------------------------------------------------------
>>> >>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >>> >> For additional commands, e-mail: 
>>> users-help at gridengine.sunsource.net
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> >>> > 
>>> ---------------------------------------------------------------------
>>> >>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >>>
>>> >>>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>> 
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list