[GE users] Qmaster seg faulting after 6.1u2 updgrade

Heywood, Todd heywood at cshl.edu
Mon Aug 27 17:31:44 BST 2007


Actually, I was upgrading to 6.1u2 because usersets were being lost when I
SGE was stopped/restarted, which I figured was due to the 6.1 issue:

2050     6422335   still used usersets/project/calendar/pe/checkpoint can be
removed under certain conditions

I luckily had a backup of the entire SGEROOT directory, and am back to
running 6.1. Sure enough, when I restarted SGE, the usersets were gone, and
I had to redefine them.

One thing to watch out for when restoring the config from backup (inst_sge
-rst) is that the backup accounting file overwrites the current accounting
file, so you can lose accounting info.

Anyways, it seems I can't upgrade to 6.1u2 because qmaster then seg faults
(AMD64/EM64T, kernel 2.4, 2.6, glibc >= 2.3.2).

Todd


On 8/27/07 11:46 AM, "david zanella" <zanella at mayo.edu> wrote:

> I've had something similar happen twice.
> 
> First time I lost a queue, tried applying patches and qmaster refused to
> start. 
> When it came up, it had lost most of the queues and hostgroups.
> 
> It happened again where the qmaster crashed one weekend and refused to start.
> Tracked it down to (at least) one old/defunct host in the config. I went into
> the qmaster config files and manually removed all references to old/defunt
> hosts. Kept restarting qmaster and watching the messages file (look closely).
> Kept cleaning things up manually until it eventually started. I seem to
> remember 
> there were some discrepencies between hostnames and FQDN's that it didn't
> like. 
> 
> Used a combination of find and rgrep to ferret out old/bad stuff.
> 
> Not for the faint of heart...and make sure you save a backup of all the config
> files somewhere...
> 
> 
>> I just installed the 6.1u2 patch to our SGE 6.1 installation, and SGE will
>> not start up:
>> 
>> [root at bhmnode2 n1ge6]# default/common/sgemaster
>>    starting sge_qmaster
>> 
>> sge_qmaster didn't start!
>> Please check the messages file
>> 
>>    starting sge_schedd
>> error: commlib error: can't connect to service (Connection refused)
>> error: getting configuration: unable to contact qmaster using port 6444 on
>> host "bhmnode2"
>> error: can't get configuration from qmaster -- backgrounding
>> [root at bhmnode2 n1ge6]#
>> 
>> 
>> There is nothing in .../spool/qmaster/messages file after the messgaes about
>> shutting down 6.1 prior to the updgrade.
>> 
>> However, /var/log/messages sas sge_qmaster is seg faulting:
>> 
>> 
>> Aug 27 10:18:50 bhmnode2 kernel: sge_qmaster[14591]: segfault at
>> 0000038700000384 rip 00000039fa471d23 rsp 0000007fbfffd520 error 4
>> 
>> 
>> I tried restoring from backup, and the backup also gives the same seg
>> faulting behavior now!
>> 
>> Any ideas/help gretaly appreciated (ASAP).
>> 
>> Thanks,
>> 
>> Todd
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list