[GE users] sge master dying

Rayson Ho rayrayson at gmail.com
Thu Jun 14 01:43:36 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I almost forgot that qmaster is threaded... can you use the gdb
sub-command "info threads" to display the status of all threads??

Rayson



On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
> Sorry again, wrong paste, I hope this has what you need...
>
>
>
> Attaching to program: /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
> process 16919
> Reading symbols from /lib/libdl.so.2...done.
> Loaded symbols for /lib/libdl.so.2
> Reading symbols from /lib/tls/libm.so.6...done.
> Loaded symbols for /lib/tls/libm.so.6
> Reading symbols from /lib/tls/libpthread.so.0...done.
> [Thread debugging using libthread_db enabled]
> [New Thread -1220095328 (LWP 16919)]
> [New Thread -1317024848 (LWP 17016)]
> [New Thread -1306534992 (LWP 17015)]
> [New Thread -1296041040 (LWP 17014)]
> [New Thread -1283458128 (LWP 17013)]
> [New Thread -1265304656 (LWP 16926)]
> [New Thread -1254814800 (LWP 16925)]
> [New Thread -1244324944 (LWP 16924)]
> [New Thread -1233835088 (LWP 16923)]
> [New Thread -1223345232 (LWP 16922)]
> Loaded symbols for /lib/tls/libpthread.so.0
> Reading symbols from /lib/tls/libc.so.6...done.
> Loaded symbols for /lib/tls/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> Reading symbols from
> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
> Loaded symbols for /common/sge/6.0u4/lib/lx24-x86/libspoolc.so
> Reading symbols from /lib/libnss_dns.so.2...done.
> Loaded symbols for /lib/libnss_dns.so.2
> Reading symbols from /lib/libresolv.so.2...done.
> Loaded symbols for /lib/libresolv.so.2
> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
> (gdb) cont
> Continuing.
>
> Program received signal SIGBUS, Bus error.
> [Switching to Thread -1317024848 (LWP 17016)]
> 0x0809a007 in hgroup_mod ()
> (gdb) where
> #0  0x0809a007 in hgroup_mod ()
> #1  0x0806d563 in sge_gdi_add_mod_generic ()
> #2  0x0806bb40 in sge_c_gdi_mod ()
> #3  0x08068cf4 in sge_c_gdi ()
> #4  0x080a941c in do_gdi_request ()
> #5  0x080a9239 in sge_qmaster_process_message ()
> #6  0x0806710c in message_thread ()
> #7  0xb75abdd8 in start_thread () from /lib/tls/libpthread.so.0
> #8  0xb754ad2a in clone () from /lib/tls/libc.so.6
> (gdb) quit
> The program is running.  Quit anyway (and detach it)? (y or n) y
> Detaching from program:
> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16919
>
>
> Iwona Sakrejda wrote:
> > Sorry, been a while since I did active development and debugging.
> > Here it is (and actually I can do qconf for users and queues,
> > just this qconf for hostgroups is giving me grief...)
> >
> >
> > iwona
> >
> >
> > [root at pc2533 debug]# gdb /common/sge/6.0u4/bin/lx24-x86/sge_qmaster 16569
> > GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
> > Copyright 2004 Free Software Foundation, Inc.
> > GDB is free software, covered by the GNU General Public License, and
> > you are
> > welcome to change it and/or distribute copies of it under certain
> > conditions.
> > Type "show copying" to see the conditions.
> > There is absolutely no warranty for GDB.  Type "show warranty" for
> > details.
> > This GDB was configured as "i386-redhat-linux-gnu"...Using host
> > libthread_db library "/lib/tls/libthread_db.so.1".
> >
> > Attaching to program:
> > /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16569
> > Reading symbols from /lib/libdl.so.2...done.
> > Loaded symbols for /lib/libdl.so.2
> > Reading symbols from /lib/tls/libm.so.6...done.
> > Loaded symbols for /lib/tls/libm.so.6
> > Reading symbols from /lib/tls/libpthread.so.0...done.
> > [Thread debugging using libthread_db enabled]
> > [New Thread -1220095328 (LWP 16569)]
> > [New Thread -1317291088 (LWP 16704)]
> > [New Thread -1306801232 (LWP 16703)]
> > [New Thread -1296307280 (LWP 16702)]
> > [New Thread -1285555280 (LWP 16701)]
> > [New Thread -1265304656 (LWP 16575)]
> > [New Thread -1254814800 (LWP 16574)]
> > [New Thread -1244324944 (LWP 16573)]
> > [New Thread -1233835088 (LWP 16572)]
> > [New Thread -1223345232 (LWP 16570)]
> > Loaded symbols for /lib/tls/libpthread.so.0
> > Reading symbols from /lib/tls/libc.so.6...done.
> > Loaded symbols for /lib/tls/libc.so.6
> > Reading symbols from /lib/ld-linux.so.2...done.
> > Loaded symbols for /lib/ld-linux.so.2
> > Reading symbols from /lib/libnss_files.so.2...done.
> > Loaded symbols for /lib/libnss_files.so.2
> > Reading symbols from
> > /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
> > Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
> > Reading symbols from /lib/libnss_dns.so.2...done.
> > Loaded symbols for /lib/libnss_dns.so.2
> > Reading symbols from /lib/libresolv.so.2...done.
> > Loaded symbols for /lib/libresolv.so.2
> > 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
> > (gdb) cont
> > Continuing.
> >
> > Program received signal SIGBUS, Bus error.
> > [Switching to Thread -1317291088 (LWP 16704)]
> > 0x0809a007 in hgroup_mod ()
> > (gdb) quit
> >
> >
> > Rayson Ho wrote:
> >> Use the gdb sub-command "where" to show the stack trace...
> >>
> >> Rayson
> >>
> >>
> >>
> >> On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
> >>> Here is what I see when it crashes while attached to gdb:
> >>> [root at pc2533 debug]# gdb /common/sge/6.0u4/bin/lx24-x86/sge_qmaster
> >>> 16569
> >>> GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
> >>> Copyright 2004 Free Software Foundation, Inc.
> >>> GDB is free software, covered by the GNU General Public License, and
> >>> you are
> >>> welcome to change it and/or distribute copies of it under certain
> >>> conditions.
> >>> Type "show copying" to see the conditions.
> >>> There is absolutely no warranty for GDB.  Type "show warranty" for
> >>> details.
> >>> This GDB was configured as "i386-redhat-linux-gnu"...Using host
> >>> libthread_db library "/lib/tls/libthread_db.so.1".
> >>>
> >>> Attaching to program:
> >>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
> >>> process 16569
> >>> Reading symbols from /lib/libdl.so.2...done.
> >>> Loaded symbols for /lib/libdl.so.2
> >>> Reading symbols from /lib/tls/libm.so.6...done.
> >>> Loaded symbols for /lib/tls/libm.so.6
> >>> Reading symbols from /lib/tls/libpthread.so.0...done.
> >>> [Thread debugging using libthread_db enabled]
> >>> [New Thread -1220095328 (LWP 16569)]
> >>> [New Thread -1317291088 (LWP 16704)]
> >>> [New Thread -1306801232 (LWP 16703)]
> >>> [New Thread -1296307280 (LWP 16702)]
> >>> [New Thread -1285555280 (LWP 16701)]
> >>> [New Thread -1265304656 (LWP 16575)]
> >>> [New Thread -1254814800 (LWP 16574)]
> >>> [New Thread -1244324944 (LWP 16573)]
> >>> [New Thread -1233835088 (LWP 16572)]
> >>> [New Thread -1223345232 (LWP 16570)]
> >>> Loaded symbols for /lib/tls/libpthread.so.0
> >>> Reading symbols from /lib/tls/libc.so.6...done.
> >>> Loaded symbols for /lib/tls/libc.so.6
> >>> Reading symbols from /lib/ld-linux.so.2...done.
> >>> Loaded symbols for /lib/ld-linux.so.2
> >>> Reading symbols from /lib/libnss_files.so.2...done.
> >>> Loaded symbols for /lib/libnss_files.so.2
> >>> Reading symbols from
> >>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
> >>> Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
> >>> Reading symbols from /lib/libnss_dns.so.2...done.
> >>> Loaded symbols for /lib/libnss_dns.so.2
> >>> Reading symbols from /lib/libresolv.so.2...done.
> >>> Loaded symbols for /lib/libresolv.so.2
> >>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
> >>> (gdb) cont
> >>> Continuing.
> >>>
> >>> Program received signal SIGBUS, Bus error.
> >>> [Switching to Thread -1317291088 (LWP 16704)]
> >>> 0x0809a007 in hgroup_mod ()
> >>> (gdb) quit
> >>>
> >>>
> >>>
> >>> Rayson Ho wrote:
> >>> > Can you attach qmaster with a debugger, so that we can get the stack
> >>> > trace when it dies??
> >>> >
> >>> > Rayson
> >>> >
> >>> >
> >>> >
> >>> > On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
> >>> >> Hi,
> >>> >>
> >>> >> I an running SGE 6.0u4 on rhel3 and It's been running ok for a year
> >>> >> or so.
> >>> >> Last week i tried qconf -mhgrp and this command repeatedly kills
> >>> all the
> >>> >> sge processes on the headnode. I connected with strace to the
> >>> sgeadmin
> >>> >> before it died and I only see:
> >>> >> rocess 16727 attached - interrupt to quit
> >>> >> futex(0xb03bebf8, FUTEX_WAIT, 16822, NULL) = -1 EINTR (Interrupted
> >>> >> system call)
> >>> >> +++ killed by SIGBUS +++
> >>> >>
> >>> >> Nothing exciting in the logs, it's just going about its bussiness...
> >>> >>
> >>> >> Suggestions on how to approch this problem would be appreciated...
> >>> >>
> >>> >> Thank You,
> >>> >>
> >>> >> iwona
> >>> >>
> >>> >>
> >>> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>> >>
> >>> >>
> >>> >
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list