[GE users] sge master dying
Andreas.Haas at Sun.COM
Andreas.Haas at Sun.COM
Wed Jun 20 11:04:03 BST 2007
On Mon, 18 Jun 2007, Iwona Sakrejda wrote:
> Andreas.Haas at Sun.COM wrote:
>> Hi Iwona,
>> you should upgrade to the most recent 6.0u10 patch version. Since 6.0u4
>> there were many changes in the source module
>> where the crashing hgroup_mod() is implemented.
> I scheduled a maintenance period for the upgrade and we are going to do it,
> but I did
> use "qconf -mhgrp <group name>" many times before with 6.0u4 and it would go
> through ok.
> Could you shed some light or point me towards some reading on how does
> it work? I am guessing qconf on an admin node is creating a temporary file
> (where?) and then sge_master is trying to read it? Who creates the new
> version in the hostgroups subdirectory?
Well, it works differently. Interaction of qconf with qmaster is entirely
request based. It uses Grid Engine GDI protocol which is based on TCP.
Upon qconf -mhgrp it sends at first a GDI GET request whose reply is
the present host group configuration of <group name>. This configuration
is then written into a temporary local file on the machine where qconf
was launched. Next the EDITOR is forked to allow the configuration be
changed. When the EDITOR has exited the changed configuration is read-in
by qconf and parsed into a GDI MOD request. This request then gets sent
to qmaster to have qmaster perform the modification in it's in-memory
data base. As part of processing the GDI MOD request qmaster does also
spool the new host group configuration to disk (classic or BDB spooling),
but that stage is never reached in qmaster GDI MOD request processing
since crashing hgroup_mod() must return successfully before hgroup_spool()
is called to perform the qmaster spooling.
> We did some rearrangement of the filesystems lately so I hope that if
> I could trace through all the permissions, I could find a problem.
Actually one can rule out permission problems with qmaster spool
files for the reasons above.
> And we will upgrade, but it would be good to understand what happened.
I can't tell you the exact reason. My assumption is one of the many
changes in daemons/qmaster/sge_hgroup_qmaster.c since 6.0u4 fixed some
data corruption problem during hgroup_mod() that crashes qmaster under
certain conditions only. Investigating this however would cost me certain
time and 6.0u4 stems from May 2005.
For feeling sure upgrading to 6.0u10 (January 2007) will finally resolve
your problem I suggest you set-up a small test 6.0u10 environment for the
sole purpose of verifying qconf -mhgrp doesn't crash qmaster. When you
do this make sure you use exactly the same hostnames you are using in
your production cluster.
Note, there is a certain chance hostname resolution problem with one of
the hostnames in your host groups is causing the crash.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users