[GE users] Qmaster seg faulting after 6.1u2 updgrade

Roland Dittel Roland.Dittel at Sun.COM
Tue Aug 28 08:07:50 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Todd,

can you please start the qmaster in debug mode and send me the output. 
To debug our binaries source $SGE_ROOT/util/dl.sh or 
$SGE_ROOT/util/dl.csh depending on your shell and set the 
LD_LIBRARY_PATH to "$SGE_ROOT/lx24-amd64/lib". Then do a "dl 2" and 
start the qmaster manually with $SGE_ROOT/bin/lx24-amd64/sge_qmaster.

Thanks
Roland

Heywood, Todd wrote:
> Actually, I was upgrading to 6.1u2 because usersets were being lost when I
> SGE was stopped/restarted, which I figured was due to the 6.1 issue:
> 
> 2050     6422335   still used usersets/project/calendar/pe/checkpoint can be
> removed under certain conditions
> 
> I luckily had a backup of the entire SGEROOT directory, and am back to
> running 6.1. Sure enough, when I restarted SGE, the usersets were gone, and
> I had to redefine them.
> 
> One thing to watch out for when restoring the config from backup (inst_sge
> -rst) is that the backup accounting file overwrites the current accounting
> file, so you can lose accounting info.
> 
> Anyways, it seems I can't upgrade to 6.1u2 because qmaster then seg faults
> (AMD64/EM64T, kernel 2.4, 2.6, glibc >= 2.3.2).
> 
> Todd
> 
> 
> On 8/27/07 11:46 AM, "david zanella" <zanella at mayo.edu> wrote:
> 
>> I've had something similar happen twice.
>>
>> First time I lost a queue, tried applying patches and qmaster refused to
>> start. 
>> When it came up, it had lost most of the queues and hostgroups.
>>
>> It happened again where the qmaster crashed one weekend and refused to start.
>> Tracked it down to (at least) one old/defunct host in the config. I went into
>> the qmaster config files and manually removed all references to old/defunt
>> hosts. Kept restarting qmaster and watching the messages file (look closely).
>> Kept cleaning things up manually until it eventually started. I seem to
>> remember 
>> there were some discrepencies between hostnames and FQDN's that it didn't
>> like. 
>>
>> Used a combination of find and rgrep to ferret out old/bad stuff.
>>
>> Not for the faint of heart...and make sure you save a backup of all the config
>> files somewhere...
>>
>>
>>> I just installed the 6.1u2 patch to our SGE 6.1 installation, and SGE will
>>> not start up:
>>>
>>> [root at bhmnode2 n1ge6]# default/common/sgemaster
>>>    starting sge_qmaster
>>>
>>> sge_qmaster didn't start!
>>> Please check the messages file
>>>
>>>    starting sge_schedd
>>> error: commlib error: can't connect to service (Connection refused)
>>> error: getting configuration: unable to contact qmaster using port 6444 on
>>> host "bhmnode2"
>>> error: can't get configuration from qmaster -- backgrounding
>>> [root at bhmnode2 n1ge6]#
>>>
>>>
>>> There is nothing in .../spool/qmaster/messages file after the messgaes about
>>> shutting down 6.1 prior to the updgrade.
>>>
>>> However, /var/log/messages sas sge_qmaster is seg faulting:
>>>
>>>
>>> Aug 27 10:18:50 bhmnode2 kernel: sge_qmaster[14591]: segfault at
>>> 0000038700000384 rip 00000039fa471d23 rsp 0000007fbfffd520 error 4
>>>
>>>
>>> I tried restoring from backup, and the backup also gives the same seg
>>> faulting behavior now!
>>>
>>> Any ideas/help gretaly appreciated (ASAP).
>>>
>>> Thanks,
>>>
>>> Todd
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


-- 
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Roland Dittel               Tel: +49 (0)941 3075-275 (x60275)
Software Engineering        Fax: +49 (0)941 3075-222 (x60222)
Sun Microsystems GmbH
Dr.-Leo-Ritter-Str. 7       mailto:roland.dittel at sun.com
D-93049 Regensburg          http://www.sun.com/gridware
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Registered Office / Sitz der Gesellschaft:
   Sun Microsystems GmbH
   Sonnenallee 1
   D-85551 Kirchheim-Heimstetten
   Germany
Commercial register of the Local Court of Munich /
Handelsregistereintrag Amtsgericht Muenchen:
   HRB 161028
Managing Directors / Geschaeftsfuehrer:
   Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
Chairman of the Supervisory Board / Vorsitzender des Aufsichtsrates
   Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list