[GE users] 6.0u4 qmaster crashing

Andy Schwierskott andy.schwierskott at sun.com
Fri Dec 16 15:11:31 GMT 2005


Mike,

unless your are on Solaris and you use "coreadm" you don't get a core dump
if qmaster is started by user root and you configured an admin user
(effectivly qmaster is seen like a suid-root binary by the OS and thus no
cores are created).

If you start qmaster and scheduler (for these two daemons it is possible
without any harm for functionality) under the admin user account, the OS
will create a core dump (of course the core limit needs to be >> 0).

The qmaster core dump will be in <qmaster_spool_dir>.

Andy

> Mike Brown wrote:
>> I'm wondering the best way to debug the qmaster.  It seems that when I try 
>> to set the debug level, only the qmaster (but not schedd) starts.
>> Are there any other suggestions besides reading this output?  I'm running 
>> 6.0u4 and noticing the qmaster dying every few days.  Nothing serious 
>> appears in the schedd or qmaster output.  I've upgraded to 6.0u7 in case 
>> that will fix anything, but may still need to debug.
>
> o Have you tried to set the debug level with the util/dl.csh or
>  util/dl.sh script.  This script sets the SGE_DEBUG_LEVEL variable.
>  If it is set the qmaster do not daemonize, depending on the debug
>  level alot of information is printed to stdout.
> Example:
>
> # qconf -km
> # source util/dl.csh 1
> # dl 1
> # echo $SGE_DEBUG_LEVEL
> 2 0 0 0 0 0 0 0
> # default/common/sgeqmaster
>   starting sge_qmaster
>     0  30641 16384     ****** starting localization procedure ... **********
>     1  30641 16384     could not get environment variable "GRIDPACKAGE"
>     2  30641 16384     could not get environment variable "GRIDLOCALEDIR"
>     3  30641 16384     environment LANGUAGE or LANG is not set; no language 
> selected - using defaults
>     4  30641 16384     setlocale() returns "C"
>     5  30641 16384     locale directory: >/tools/testsuite/sge/locale<
>     6  30641 16384     package file:     >lx26-x86/gridengine.mo<
>     7  30641 16384     language (LANG):  >C<
> ...
>
>
> o Do you have a core file?
>
> If you have a core file you can use a debugger like gdb to find out in what 
> function the qmaster terminates:
>
> # gdb $SGE_ROOT/bin/lx24-x86/sge_qmaster core
> (gdb) where
> .. stacktrace will be printed.
>
> You can send us the stacktrace for futher diagnostic.
>
> Richard
>
>
>> Thanks!
>> 
>> Mike
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> 
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list