[GE users] sge master dying

Iwona Sakrejda isakrejda at lbl.gov
Sat Jun 23 01:04:52 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

So I am almost ready to run the test, just wanted to make sure that I 
don't mess up my production cluster.

Andreas.Haas at Sun.COM wrote:
> As for the setup it should be sufficient
> to resume the hostgroup and the cluster queues from your 6.0u4 cluster.
> Installing execution daemons shouldn't be needed. 

So if I have a master with 6.0u10 set up on a test node (but that node 
belongs to my production
cluster)and I put on it same definitions as I have on the 6.0u4 master, 
that will not confuse
the excution nodes? Just wanted to be really, really sure....

Thank You...

iwona


> On Thu, 21 Jun 2007, Iwona Sakrejda wrote:
>
>> Hi,
>>
>> please see below my replies....
>>
>>
>> Andreas.Haas at Sun.COM wrote:
>>>  ... actually to be entirely sure you pass each hostname to
>>>
>>>    $SGE_ROOT/utilbin/<arch>/gethostbyname
>>>
>>> as to check whether C library call gethostbyname() can also cope 
>>> with them. If that works fine, we can be sure it is not so trivial ...
>> I checked that way all the hosts in all the hostgroups and they 
>> resolve fine. Here is how I did it:
>> cat @athlon03|grep hostlist|sed 's/ /\n/'g|grep pc|awk '{print " 
>> /common/sge/6.0u4/utilbin/lx24-x86/gethostbyname "$1}'|sh
>> and like that group by group which I hope eliminates possible typos 
>> (all my host names start with "pc").
>>
>> and for each host I get something like that:
>>
>> Host Address(es): 128.55.37.73
>> Hostname: pc2834.nersc.gov
>> Aliases:  pc2834
>
> Ok.
>
>>
>> Some of my hostgroups (not the ones that I attempted to modify) are 
>> empty and that's ok, right?
>
> Shouldn't be a problem.
>
>>> Actually I'm curious to see the host group before the 'qconf -mhgrp' 
>>> change
>>> and the new host group configuration. 
>> I tried two different hostgroups on 4 or so occasions and the master 
>> always crashed.
>> I did not experiment more because taht upsets users.
>>
>> The hostgroup I was trying to modify looks as follows:
>>
>> [root at pc2533 hostgroups]# cat @debug
>> # Version: SGE 6.0u4
>> #
>> # DO NOT MODIFY THIS FILE MANUALLY!
>> #
>> group_name  @debug
>> hostlist    pc2632.nersc.gov pc2104.nersc.gov pc0920.nersc.gov 
>> pc0922.nersc.gov pc0928.nersc.gov
>>
>> and qconf -shgrp @debug shows it as:
>>
>> pc2609 74% qconf -shgrp @debug
>> group_name @debug
>> hostlist pc2632.nersc.gov pc2104.nersc.gov pc0920.nersc.gov 
>> pc0922.nersc.gov \
>>        pc0928.nersc.gov
>>
>> I was trying to add a space and pc2302.nersc.gov in the line above.
>
> Strange. I had expected by far less trivial configuration such as 
> nested hostgroups.
>
>>
>> Actually when I made a typo (a "," instead of " " for the separator) 
>> I got
>> a message about the problem and the master survived that without any 
>> problems.
>
> Unlike adding a host syntax errors can already be caught at client side.
>
>>> I guess you are using that host group in a cluster queue configuration.
>> yes.
>
> It could be that the issue with your change is that a new queue 
> instance object
> for host pc0928.nersc.gov must be created by qmaster upon the change. The
> code in hgroup_mod() does a couple of verifications to ensure qmaster 
> data base consistency. For performing those verifications queue 
> instance objects
> get temporarily created and then reverted again in the same function.
>
>    
> http://gridengine.sunsource.net/source/browse/gridengine/source/daemons/qmaster/sge_hgroup_qmaster.c?rev=1.29&view=markup 
>
>
> I watched out already for cases where already free()'d memory possibly 
> gets accessed, but the code seems clean. Although a couple of other 
> are being called by hgroup_mod() but the qmaster crash happens 
> directly in that function.
>
> I still suggest you try to reproduce the error in a more recent 
> version of Grid Engine. If you hit the same error in 6.0u10 or 6.1 you 
> can be
> sure we will hunt that beast. As for the setup it should be sufficient
> to resume the hostgroup and the cluster queues from your 6.0u4 cluster.
> Installing execution daemons shouldn't be needed.
>
> Regards,
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list