[GE users] sge master dying

Iwona Sakrejda isakrejda at lbl.gov
Mon Jun 18 21:20:33 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

Andreas.Haas at Sun.COM wrote:
> Hi Iwona,
>
> you should upgrade to the most recent 6.0u10 patch version. Since 
> 6.0u4 there were many changes in the source module
>
>    daemons/qmaster/sge_hgroup_qmaster.c
>
> where the crashing hgroup_mod() is implemented.
I scheduled a maintenance period for the upgrade and we are going to do 
it, but I did
use "qconf -mhgrp <group name>" many times before with 6.0u4 and it 
would go through ok.

Could you shed some light or point me towards some reading on how does
it work? I am guessing qconf on an admin node is creating a temporary file
(where?) and then sge_master is trying to read it? Who creates the new 
version
in the hostgroups subdirectory?
We did some rearrangement of the filesystems lately so I hope that if
I could trace through all the permissions,  I could find a problem.

And we will upgrade, but it would be good to understand what happened.

Thank You,

iwona


>
> Regards,
> Andreas
>
>
> On Thu, 14 Jun 2007, Iwona Sakrejda wrote:
>
>> Here it includes the thread info...
>>
>>
>> [root at pc2533 root]# gdb -p 17325
>> GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
>> Copyright 2004 Free Software Foundation, Inc.
>> GDB is free software, covered by the GNU General Public License, and 
>> you are
>> welcome to change it and/or distribute copies of it under certain 
>> conditions.
>> Type "show copying" to see the conditions.
>> There is absolutely no warranty for GDB.  Type "show warranty" for 
>> details.
>> This GDB was configured as "i386-redhat-linux-gnu".
>> Attaching to process 17325
>> Reading symbols from 
>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster...done.
>> Using host libthread_db library "/lib/tls/libthread_db.so.1".
>> Reading symbols from /lib/libdl.so.2...done.
>> Loaded symbols for /lib/libdl.so.2
>> Reading symbols from /lib/tls/libm.so.6...done.
>> Loaded symbols for /lib/tls/libm.so.6
>> Reading symbols from /lib/tls/libpthread.so.0...done.
>> [Thread debugging using libthread_db enabled]
>> [New Thread -1220095328 (LWP 17325)]
>> [New Thread -1364210768 (LWP 17633)]
>> [New Thread -1353716816 (LWP 17632)]
>> [New Thread -1343226960 (LWP 17631)]
>> [New Thread -1330644048 (LWP 17630)]
>> [New Thread -1265304656 (LWP 17332)]
>> [New Thread -1254814800 (LWP 17331)]
>> [New Thread -1244324944 (LWP 17330)]
>> [New Thread -1233835088 (LWP 17329)]
>> [New Thread -1223345232 (LWP 17328)]
>> Loaded symbols for /lib/tls/libpthread.so.0
>> Reading symbols from /lib/tls/libc.so.6...done.
>> Loaded symbols for /lib/tls/libc.so.6
>> Reading symbols from /lib/ld-linux.so.2...done.
>> Loaded symbols for /lib/ld-linux.so.2
>> Reading symbols from /lib/libnss_files.so.2...done.
>> Loaded symbols for /lib/libnss_files.so.2
>> Reading symbols from 
>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>> Loaded symbols for /common/sge/6.0u4/lib/lx24-x86/libspoolc.so
>> Reading symbols from /lib/libnss_dns.so.2...done.
>> Loaded symbols for /lib/libnss_dns.so.2
>> Reading symbols from /lib/libresolv.so.2...done.
>> Loaded symbols for /lib/libresolv.so.2
>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>> (gdb) cont
>> Continuing.
>>
>> Program received signal SIGBUS, Bus error.
>> [Switching to Thread -1353716816 (LWP 17632)]
>> 0x0809a007 in hgroup_mod ()
>> (gdb) where
>> #0  0x0809a007 in hgroup_mod ()
>> #1  0x0806d563 in sge_gdi_add_mod_generic ()
>> #2  0x0806bb40 in sge_c_gdi_mod ()
>> #3  0x08068cf4 in sge_c_gdi ()
>> #4  0x080a941c in do_gdi_request ()
>> #5  0x080a9239 in sge_qmaster_process_message ()
>> #6  0x0806710c in message_thread ()
>> #7  0xb75abdd8 in start_thread () from /lib/tls/libpthread.so.0
>> #8  0xb754ad2a in clone () from /lib/tls/libc.so.6
>> (gdb) info threads
>> 10 Thread -1223345232 (LWP 17328)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>  from /lib/tls/libpthread.so.0
>> 9 Thread -1233835088 (LWP 17329)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
>> 8 Thread -1244324944 (LWP 17330)  0xb75018c1 in gettimeofday () from 
>> /lib/tls/libc.so.6
>> 7 Thread -1254814800 (LWP 17331)  0xb75b0939 in __lll_mutex_lock_wait 
>> () from /lib/tls/libpthread.so.0
>> 6 Thread -1265304656 (LWP 17332)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
>> 5 Thread -1330644048 (LWP 17630)  0xb75ae59b in 
>> pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
>> 4 Thread -1343226960 (LWP 17631)  0xb75b1c84 in sigwait () from 
>> /lib/tls/libpthread.so.0
>> * 3 Thread -1353716816 (LWP 17632)  0x0809a007 in hgroup_mod ()
>> 2 Thread -1364210768 (LWP 17633)  0xb75b0939 in __lll_mutex_lock_wait 
>> () from /lib/tls/libpthread.so.0
>> 1 Thread -1220095328 (LWP 17325)  0xb75acd58 in pthread_join () from 
>> /lib/tls/libpthread.so.0
>> (gdb) quit
>> The program is running.  Quit anyway (and detach it)? (y or n) y
>> Detaching from program: 
>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 17325
>>
>>
>> Rayson Ho wrote:
>>> Hi,
>>>
>>> I almost forgot that qmaster is threaded... can you use the gdb
>>> sub-command "info threads" to display the status of all threads??
>>>
>>> Rayson
>>>
>>>
>>>
>>> On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>>> Sorry again, wrong paste, I hope this has what you need...
>>>>
>>>>
>>>>
>>>> Attaching to program: 
>>>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
>>>> process 16919
>>>> Reading symbols from /lib/libdl.so.2...done.
>>>> Loaded symbols for /lib/libdl.so.2
>>>> Reading symbols from /lib/tls/libm.so.6...done.
>>>> Loaded symbols for /lib/tls/libm.so.6
>>>> Reading symbols from /lib/tls/libpthread.so.0...done.
>>>> [Thread debugging using libthread_db enabled]
>>>> [New Thread -1220095328 (LWP 16919)]
>>>> [New Thread -1317024848 (LWP 17016)]
>>>> [New Thread -1306534992 (LWP 17015)]
>>>> [New Thread -1296041040 (LWP 17014)]
>>>> [New Thread -1283458128 (LWP 17013)]
>>>> [New Thread -1265304656 (LWP 16926)]
>>>> [New Thread -1254814800 (LWP 16925)]
>>>> [New Thread -1244324944 (LWP 16924)]
>>>> [New Thread -1233835088 (LWP 16923)]
>>>> [New Thread -1223345232 (LWP 16922)]
>>>> Loaded symbols for /lib/tls/libpthread.so.0
>>>> Reading symbols from /lib/tls/libc.so.6...done.
>>>> Loaded symbols for /lib/tls/libc.so.6
>>>> Reading symbols from /lib/ld-linux.so.2...done.
>>>> Loaded symbols for /lib/ld-linux.so.2
>>>> Reading symbols from /lib/libnss_files.so.2...done.
>>>> Loaded symbols for /lib/libnss_files.so.2
>>>> Reading symbols from
>>>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>>>> Loaded symbols for /common/sge/6.0u4/lib/lx24-x86/libspoolc.so
>>>> Reading symbols from /lib/libnss_dns.so.2...done.
>>>> Loaded symbols for /lib/libnss_dns.so.2
>>>> Reading symbols from /lib/libresolv.so.2...done.
>>>> Loaded symbols for /lib/libresolv.so.2
>>>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>>>> (gdb) cont
>>>> Continuing.
>>>>
>>>> Program received signal SIGBUS, Bus error.
>>>> [Switching to Thread -1317024848 (LWP 17016)]
>>>> 0x0809a007 in hgroup_mod ()
>>>> (gdb) where
>>>> #0  0x0809a007 in hgroup_mod ()
>>>> #1  0x0806d563 in sge_gdi_add_mod_generic ()
>>>> #2  0x0806bb40 in sge_c_gdi_mod ()
>>>> #3  0x08068cf4 in sge_c_gdi ()
>>>> #4  0x080a941c in do_gdi_request ()
>>>> #5  0x080a9239 in sge_qmaster_process_message ()
>>>> #6  0x0806710c in message_thread ()
>>>> #7  0xb75abdd8 in start_thread () from /lib/tls/libpthread.so.0
>>>> #8  0xb754ad2a in clone () from /lib/tls/libc.so.6
>>>> (gdb) quit
>>>> The program is running.  Quit anyway (and detach it)? (y or n) y
>>>> Detaching from program:
>>>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16919
>>>>
>>>>
>>>> Iwona Sakrejda wrote:
>>>> > Sorry, been a while since I did active development and debugging.
>>>> > Here it is (and actually I can do qconf for users and queues,
>>>> > just this qconf for hostgroups is giving me grief...)
>>>> >
>>>> >
>>>> > iwona
>>>> >
>>>> >
>>>> > [root at pc2533 debug]# gdb 
>>>> /common/sge/6.0u4/bin/lx24-x86/sge_qmaster 16569
>>>> > GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
>>>> > Copyright 2004 Free Software Foundation, Inc.
>>>> > GDB is free software, covered by the GNU General Public License, and
>>>> > you are
>>>> > welcome to change it and/or distribute copies of it under certain
>>>> > conditions.
>>>> > Type "show copying" to see the conditions.
>>>> > There is absolutely no warranty for GDB.  Type "show warranty" for
>>>> > details.
>>>> > This GDB was configured as "i386-redhat-linux-gnu"...Using host
>>>> > libthread_db library "/lib/tls/libthread_db.so.1".
>>>> >
>>>> > Attaching to program:
>>>> > /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster, process 16569
>>>> > Reading symbols from /lib/libdl.so.2...done.
>>>> > Loaded symbols for /lib/libdl.so.2
>>>> > Reading symbols from /lib/tls/libm.so.6...done.
>>>> > Loaded symbols for /lib/tls/libm.so.6
>>>> > Reading symbols from /lib/tls/libpthread.so.0...done.
>>>> > [Thread debugging using libthread_db enabled]
>>>> > [New Thread -1220095328 (LWP 16569)]
>>>> > [New Thread -1317291088 (LWP 16704)]
>>>> > [New Thread -1306801232 (LWP 16703)]
>>>> > [New Thread -1296307280 (LWP 16702)]
>>>> > [New Thread -1285555280 (LWP 16701)]
>>>> > [New Thread -1265304656 (LWP 16575)]
>>>> > [New Thread -1254814800 (LWP 16574)]
>>>> > [New Thread -1244324944 (LWP 16573)]
>>>> > [New Thread -1233835088 (LWP 16572)]
>>>> > [New Thread -1223345232 (LWP 16570)]
>>>> > Loaded symbols for /lib/tls/libpthread.so.0
>>>> > Reading symbols from /lib/tls/libc.so.6...done.
>>>> > Loaded symbols for /lib/tls/libc.so.6
>>>> > Reading symbols from /lib/ld-linux.so.2...done.
>>>> > Loaded symbols for /lib/ld-linux.so.2
>>>> > Reading symbols from /lib/libnss_files.so.2...done.
>>>> > Loaded symbols for /lib/libnss_files.so.2
>>>> > Reading symbols from
>>>> > /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>>>> > Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
>>>> > Reading symbols from /lib/libnss_dns.so.2...done.
>>>> > Loaded symbols for /lib/libnss_dns.so.2
>>>> > Reading symbols from /lib/libresolv.so.2...done.
>>>> > Loaded symbols for /lib/libresolv.so.2
>>>> > 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>>>> > (gdb) cont
>>>> > Continuing.
>>>> >
>>>> > Program received signal SIGBUS, Bus error.
>>>> > [Switching to Thread -1317291088 (LWP 16704)]
>>>> > 0x0809a007 in hgroup_mod ()
>>>> > (gdb) quit
>>>> >
>>>> >
>>>> > Rayson Ho wrote:
>>>> >> Use the gdb sub-command "where" to show the stack trace...
>>>> >>
>>>> >> Rayson
>>>> >>
>>>> >>
>>>> >>
>>>> >> On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>>> >>> Here is what I see when it crashes while attached to gdb:
>>>> >>> [root at pc2533 debug]# gdb 
>>>> /common/sge/6.0u4/bin/lx24-x86/sge_qmaster
>>>> >>> 16569
>>>> >>> GNU gdb Red Hat Linux (6.1post-1.20040607.17rh)
>>>> >>> Copyright 2004 Free Software Foundation, Inc.
>>>> >>> GDB is free software, covered by the GNU General Public 
>>>> License, and
>>>> >>> you are
>>>> >>> welcome to change it and/or distribute copies of it under certain
>>>> >>> conditions.
>>>> >>> Type "show copying" to see the conditions.
>>>> >>> There is absolutely no warranty for GDB.  Type "show warranty" for
>>>> >>> details.
>>>> >>> This GDB was configured as "i386-redhat-linux-gnu"...Using host
>>>> >>> libthread_db library "/lib/tls/libthread_db.so.1".
>>>> >>>
>>>> >>> Attaching to program:
>>>> >>> /chos/software/sge/6.0u4/bin/lx24-x86/sge_qmaster,
>>>> >>> process 16569
>>>> >>> Reading symbols from /lib/libdl.so.2...done.
>>>> >>> Loaded symbols for /lib/libdl.so.2
>>>> >>> Reading symbols from /lib/tls/libm.so.6...done.
>>>> >>> Loaded symbols for /lib/tls/libm.so.6
>>>> >>> Reading symbols from /lib/tls/libpthread.so.0...done.
>>>> >>> [Thread debugging using libthread_db enabled]
>>>> >>> [New Thread -1220095328 (LWP 16569)]
>>>> >>> [New Thread -1317291088 (LWP 16704)]
>>>> >>> [New Thread -1306801232 (LWP 16703)]
>>>> >>> [New Thread -1296307280 (LWP 16702)]
>>>> >>> [New Thread -1285555280 (LWP 16701)]
>>>> >>> [New Thread -1265304656 (LWP 16575)]
>>>> >>> [New Thread -1254814800 (LWP 16574)]
>>>> >>> [New Thread -1244324944 (LWP 16573)]
>>>> >>> [New Thread -1233835088 (LWP 16572)]
>>>> >>> [New Thread -1223345232 (LWP 16570)]
>>>> >>> Loaded symbols for /lib/tls/libpthread.so.0
>>>> >>> Reading symbols from /lib/tls/libc.so.6...done.
>>>> >>> Loaded symbols for /lib/tls/libc.so.6
>>>> >>> Reading symbols from /lib/ld-linux.so.2...done.
>>>> >>> Loaded symbols for /lib/ld-linux.so.2
>>>> >>> Reading symbols from /lib/libnss_files.so.2...done.
>>>> >>> Loaded symbols for /lib/libnss_files.so.2
>>>> >>> Reading symbols from
>>>> >>> /chos/software/sge/6.0u4/lib/lx24-x86/libspoolc.so...done.
>>>> >>> Loaded symbols for /software/sge/6.0u4/lib/lx24-x86/libspoolc.so
>>>> >>> Reading symbols from /lib/libnss_dns.so.2...done.
>>>> >>> Loaded symbols for /lib/libnss_dns.so.2
>>>> >>> Reading symbols from /lib/libresolv.so.2...done.
>>>> >>> Loaded symbols for /lib/libresolv.so.2
>>>> >>> 0xb75acd58 in pthread_join () from /lib/tls/libpthread.so.0
>>>> >>> (gdb) cont
>>>> >>> Continuing.
>>>> >>>
>>>> >>> Program received signal SIGBUS, Bus error.
>>>> >>> [Switching to Thread -1317291088 (LWP 16704)]
>>>> >>> 0x0809a007 in hgroup_mod ()
>>>> >>> (gdb) quit
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Rayson Ho wrote:
>>>> >>> > Can you attach qmaster with a debugger, so that we can get 
>>>> the stack
>>>> >>> > trace when it dies??
>>>> >>> >
>>>> >>> > Rayson
>>>> >>> >
>>>> >>> >
>>>> >>> >
>>>> >>> > On 6/13/07, Iwona Sakrejda <isakrejda at lbl.gov> wrote:
>>>> >>> >> Hi,
>>>> >>> >>
>>>> >>> >> I an running SGE 6.0u4 on rhel3 and It's been running ok for 
>>>> a year
>>>> >>> >> or so.
>>>> >>> >> Last week i tried qconf -mhgrp and this command repeatedly 
>>>> kills
>>>> >>> all the
>>>> >>> >> sge processes on the headnode. I connected with strace to the
>>>> >>> sgeadmin
>>>> >>> >> before it died and I only see:
>>>> >>> >> rocess 16727 attached - interrupt to quit
>>>> >>> >> futex(0xb03bebf8, FUTEX_WAIT, 16822, NULL) = -1 EINTR 
>>>> (Interrupted
>>>> >>> >> system call)
>>>> >>> >> +++ killed by SIGBUS +++
>>>> >>> >>
>>>> >>> >> Nothing exciting in the logs, it's just going about its 
>>>> bussiness...
>>>> >>> >>
>>>> >>> >> Suggestions on how to approch this problem would be 
>>>> appreciated...
>>>> >>> >>
>>>> >>> >> Thank You,
>>>> >>> >>
>>>> >>> >> iwona
>>>> >>> >>
>>>> >>> >>
>>>> >>> 
>>>> ---------------------------------------------------------------------
>>>> >>> >> To unsubscribe, e-mail: 
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> >>> >> For additional commands, e-mail: 
>>>> users-help at gridengine.sunsource.net
>>>> >>> >>
>>>> >>> >>
>>>> >>> >
>>>> >>> > 
>>>> ---------------------------------------------------------------------
>>>> >>> > To unsubscribe, e-mail: 
>>>> users-unsubscribe at gridengine.sunsource.net
>>>> >>> > For additional commands, e-mail: 
>>>> users-help at gridengine.sunsource.net
>>>> >>>
>>>> >>> 
>>>> ---------------------------------------------------------------------
>>>> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> >>> For additional commands, e-mail: 
>>>> users-help at gridengine.sunsource.net
>>>> >>>
>>>> >>>
>>>> >>
>>>> >> 
>>>> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> >> For additional commands, e-mail: 
>>>> users-help at gridengine.sunsource.net
>>>> >
>>>> > 
>>>> ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> http://gridengine.info/
>
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
> Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list