[GE users] sge master dying

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Fri Jun 22 14:33:13 BST 2007


On Thu, 21 Jun 2007, Iwona Sakrejda wrote:

> Hi,
>
> please see below my replies....
>
>
> Andreas.Haas at Sun.COM wrote:
>>  ... actually to be entirely sure you pass each hostname to
>>
>>    $SGE_ROOT/utilbin/<arch>/gethostbyname
>> 
>> as to check whether C library call gethostbyname() can also cope with them. 
>> If that works fine, we can be sure it is not so trivial ...
> I checked that way all the hosts in all the hostgroups and they resolve fine. 
> Here is how I did it:
> cat @athlon03|grep hostlist|sed 's/ /\n/'g|grep pc|awk '{print " 
> /common/sge/6.0u4/utilbin/lx24-x86/gethostbyname "$1}'|sh
> and like that group by group which I hope eliminates possible typos (all my 
> host names start with "pc").
>
> and for each host I get something like that:
>
> Host Address(es): 128.55.37.73
> Hostname: pc2834.nersc.gov
> Aliases:  pc2834

Ok.

>
> Some of my hostgroups (not the ones that I attempted to modify) are empty and 
> that's ok, right?

Shouldn't be a problem.

>> Actually I'm curious to see the host group before the 'qconf -mhgrp' change
>> and the new host group configuration. 
> I tried two different hostgroups on 4 or so occasions and the master always 
> crashed.
> I did not experiment more because taht upsets users.
>
> The hostgroup I was trying to modify looks as follows:
>
> [root at pc2533 hostgroups]# cat @debug
> # Version: SGE 6.0u4
> #
> # DO NOT MODIFY THIS FILE MANUALLY!
> #
> group_name  @debug
> hostlist    pc2632.nersc.gov pc2104.nersc.gov pc0920.nersc.gov 
> pc0922.nersc.gov pc0928.nersc.gov
>
> and qconf -shgrp @debug shows it as:
>
> pc2609 74% qconf -shgrp @debug
> group_name @debug
> hostlist pc2632.nersc.gov pc2104.nersc.gov pc0920.nersc.gov pc0922.nersc.gov 
> \
>        pc0928.nersc.gov
>
> I was trying to add a space and pc2302.nersc.gov in the line above.

Strange. I had expected by far less trivial configuration such as nested 
hostgroups.

>
> Actually when I made a typo (a "," instead of " " for the separator) I got
> a message about the problem and the master survived that without any 
> problems.

Unlike adding a host syntax errors can already be caught at client side.

>> I guess you are using that host group in a cluster queue configuration.
> yes.

It could be that the issue with your change is that a new queue instance object
for host pc0928.nersc.gov must be created by qmaster upon the change. The
code in hgroup_mod() does a couple of verifications to ensure qmaster data 
base consistency. For performing those verifications queue instance objects
get temporarily created and then reverted again in the same function.

    http://gridengine.sunsource.net/source/browse/gridengine/source/daemons/qmaster/sge_hgroup_qmaster.c?rev=1.29&view=markup

I watched out already for cases where already free()'d memory possibly gets 
accessed, but the code seems clean. Although a couple of other are being 
called by hgroup_mod() but the qmaster crash happens directly in that function.

I still suggest you try to reproduce the error in a more recent version 
of Grid Engine. If you hit the same error in 6.0u10 or 6.1 you can be
sure we will hunt that beast. As for the setup it should be sufficient
to resume the hostgroup and the cluster queues from your 6.0u4 cluster.
Installing execution daemons shouldn't be needed.

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list