[GE users] sge master dying

Iwona Sakrejda isakrejda at lbl.gov
Wed Jun 20 23:17:44 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]



Andreas.Haas at Sun.COM wrote:
Note, there is a certain chance hostname resolution problem with one of the
hostnames in your host groups is causing the crash.

I checked and all the hostnames are in /etc/hosts with correct ip addresses
that should be good enough, right?

And thanks a lot for detaild explanation.

Iwona

> On Mon, 18 Jun 2007, Iwona Sakrejda wrote:
>
>> Hi,
>>
>> Andreas.Haas at Sun.COM wrote:
>>> Hi Iwona,
>>>
>>> you should upgrade to the most recent 6.0u10 patch version. Since 
>>> 6.0u4 there were many changes in the source module
>>>
>>>    daemons/qmaster/sge_hgroup_qmaster.c
>>>
>>> where the crashing hgroup_mod() is implemented.
>> I scheduled a maintenance period for the upgrade and we are going to 
>> do it, but I did
>> use "qconf -mhgrp <group name>" many times before with 6.0u4 and it 
>> would go through ok.
>>
>> Could you shed some light or point me towards some reading on how does
>> it work? I am guessing qconf on an admin node is creating a temporary 
>> file
>> (where?) and then sge_master is trying to read it? Who creates the 
>> new version in the hostgroups subdirectory?
>
> Well, it works differently. Interaction of qconf with qmaster is 
> entirely request based. It uses Grid Engine GDI protocol which is 
> based on TCP.
>
> Upon qconf -mhgrp it sends at first a GDI GET request whose reply is 
> the present host group configuration of <group name>. This configuration
> is then written into a temporary local file on the machine where qconf 
> was launched. Next the EDITOR is forked to allow the configuration be 
> changed. When the EDITOR has exited the changed configuration is read-in
> by qconf and parsed into a GDI MOD request. This request then gets sent
> to qmaster to have qmaster perform the modification in it's in-memory
> data base. As part of processing the GDI MOD request qmaster does also
> spool the new host group configuration to disk (classic or BDB spooling),
> but that stage is never reached in qmaster GDI MOD request processing
> since crashing hgroup_mod() must return successfully before 
> hgroup_spool()
> is called to perform the qmaster spooling.
>
>> We did some rearrangement of the filesystems lately so I hope that if
>> I could trace through all the permissions,  I could find a problem.
>
> Actually one can rule out permission problems with qmaster spool files 
> for the reasons above.
>
>>
>> And we will upgrade, but it would be good to understand what happened.
>
> I can't tell you the exact reason. My assumption is one of the many 
> changes in daemons/qmaster/sge_hgroup_qmaster.c since 6.0u4 fixed some 
> data corruption problem during hgroup_mod() that crashes qmaster under
> certain conditions only. Investigating this however would cost me 
> certain time and 6.0u4 stems from May 2005.
>
> For feeling sure upgrading to 6.0u10 (January 2007) will finally 
> resolve your problem I suggest you set-up a small test 6.0u10 
> environment for the
> sole purpose of verifying qconf -mhgrp doesn't crash qmaster. When you
> do this make sure you use exactly the same hostnames you are using in 
> your production cluster.
>
> Note, there is a certain chance hostname resolution problem with one 
> of the hostnames in your host groups is causing the crash.
>
> Regards,
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list