[GE users] switching off nodes

Dan Gruhn Dan.Gruhn at Group-W-Inc.com
Thu Feb 10 21:56:41 GMT 2005


I've worked a bit on the sgeexecd script and it seems to shutdown a SGE
execution okay now as long as there are no jobs running.  I'm attaching
my updated script.  Several changes that I have made:

1) Output conforms to the Redhat/Fedora Core convention (probably not
useful for Solaris, etc.)

2) Added a comment so that chkconfig knows where to put the script in
the startup/shutdown chain (this would be nice to have).

3) Fixed the determination of the shepherd job name when calling
Shutdown so that it takes into account that the shepherd is named
"shepherd-<JobId>".  Here is the line:

             shepherdName="sge_shepherd-`echo $jobid | sed -e
's/\..*//'`"
            Shutdown $shepherdName
$execd_spool_dir/active_jobs/$jobid/pid
	
So, the script does now properly shut down the sgeexecd and, for each
running job, shepherd-<JobId>.  However, sending SIGTERM to the shepherd
does not kill the submitted job.  Sending SIGTERM directly to the
submitted job will kill it, but the shepherd is not passing this on. 
Because of this, I suspect things are getting hung up in the master and
it doesn't let go of the execution host and when it tries to come up
later, it already thinks that it has an execution host by that name.

Rayson, I reproduce this by queuing the attached "trial" script and then
issuing the reboot command.  Exec host restart delay is about a minute. 
Is there any interaction here between max_unheard and the amount of time
the host is off?  I set max_unheard to over an hour at one point, but
usually I have it set to 30 seconds.

Perhaps someone knows more about how the shepherd is supposed to work in
this situtation.

Dan


On Thu, 2005-02-10 at 14:01, Jerome, Ron wrote:

> I have also seen this behavior with SGE 6.0 when rebooting nodes.  
> 
> Ron. 
> 
> > -----Original Message-----
> > From: raysonho at eseenet.com [mailto:raysonho at eseenet.com]
> > Sent: Thursday, February 10, 2005 1:41 PM
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] switching off nodes
> > 
> > >02/09/2005 09:04:37|execd|dgruhn-lx|E|commlib error: endpoint is not
> > >unique error (endpoint "dgruhn-lx.group-w-inc.com/execd/1" is already
> > >connected)
> > >02/09/2005 09:04:40|execd|dgruhn-lx|E|getting configuration: unable to
> > >contact qmaster using port 461 on host "alice.group-w-inc.com"
> > >02/09/2005 09:04:40|execd|dgruhn-lx|W|can't get configuration from
> > >qmaster -- waiting ...
> > >02/09/2005 09:04:41|execd|dgruhn-lx|E|there is already a client endpoint
> > >dgruhn-lx.group-w-inc.com/execd/1 connected to qmaster service
> > 
> > This may be a bug only in SGE 6.0, with SGE 5.3 it works fine. (the code
> > in
> > commlib is new, BTW)
> > 
> > How do you reproduce this?? How long did you wait to restart the exec
> > hosts??
> > 
> > Rayson
> > 
> > 
> > >
> > >
> > ---------------------------------------------------------
> > Get your FREE E-mail account at http://www.eseenet.com !
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 


    [ Part 2, Text/X-SH (Name: "sgeexecd") 309 lines. ]
    [ Unable to print this part. ]


    [ Part 3, Text/X-SH (Name: "trial") 96 lines. ]
    [ Unable to print this part. ]


    [ Part 4: "Attached Text" ]

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net



More information about the gridengine-users mailing list