Opened 15 years ago

Closed 9 years ago

#248 closed patch (fixed)

IZ1617: Bad check for jobs when removing execution hosts

Reported by: uddeborg Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u4
Severity: minor Keywords: install
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1617]

        Issue #:      1617             Platform:     All     Reporter: uddeborg (uddeborg)
       Component:     gridengine          OS:        All
     Subcomponent:    install          Version:      6.0u4      CC:    None defined
        Status:       REOPENED         Priority:     P3
      Resolution:                     Issue type:    PATCH
                                   Target milestone: ---
      Assigned to:    dom (dom)
      QA Contact:     dom
          URL:
       * Summary:     Bad check for jobs when removing execution hosts
   Status whiteboard:
      Attachments:
                      Date/filename:                                             Description:                                                            Submitted by:
                      Fri May 13 00:28:00 -0700 2005: inst_execd_uninst.sh.patch Suggested patch for inst_execd_uninst.sh (text/plain)                   uddeborg
                      Wed Jun 8 08:35:00 -0700 2005: p                           Updated patch, taken after the changes made in issue 1627. (text/plain) uddeborg

     Issue 1617 blocks:
   Votes for issue 1617:


   Opened: Fri May 13 00:27:00 -0700 2005 
------------------------


When trying to remove some execution hosts (inst_sge -ux -host xxx) I the script
complained that it could not move all jobs in the machine's queues, although I
knew for sure it was empty.  (And the machine had not been up for several days.)
 I took a look in the relevant script, inst_execd_uninst.sh.

In several places a list of queues are extracted with this pipe:

    qstat -f | grep $exechost | cut -d" " -f1

There are some problems with this

1. If one is using names without domains, it is not too unlikely one machine may
   have a name which is a substring of another machine's name.  We have one
   named "cat" and another named "catoosa" for example. The former would match
   the latter be found with the above line.

   To solve that, I suggest
   - fully resolved names are used, using ResolveHosts.
   - the grep is anchored.  The left side can be anchored with an @. If the
     "cut" is done before the "grep", the right side can be anchored as an
     end-of-line.

2. The same command is used to figure out if there are any jobs to
   suspend/reschedule.  But the command will list also empty queues. There ought
   to be a -ne flag to qstat in those cases.

The attached patch is for the 6.0U4 version.  Since finished binaries for that
version still isn't released, I've tried a similar one on our 6.0U3 system, but
have not tested this precise patch.

   ------- Additional comments from uddeborg Fri May 13 00:28:16 -0700 2005 -------
Created an attachment (id=59)
Suggested patch for inst_execd_uninst.sh

   ------- Additional comments from uddeborg Wed Jun 8 08:33:12 -0700 2005 -------
The resolution of bug 1627 also fixed the first subproblem reported here.  And
in a better way than I suggested!  The second subproblem remains.

   ------- Additional comments from uddeborg Wed Jun 8 08:35:14 -0700 2005 -------
Created an attachment (id=61)
Updated patch, taken after the changes made in issue 1627.

   ------- Additional comments from roland Fri Jun 17 07:17:00 -0700 2005 -------
take care of it

   ------- Additional comments from roland Mon Jun 20 05:57:26 -0700 2005 -------
the -ne switch only print out queues with scheduled jobs. Empty queues will be
ignored. Inside the script we want to suspend/disable... all queues, not only
queues with jobs.

If we add the -ne switch it could be that a queue get a job after we executed
the disable/suspend code. In this case the queue will be deleted while the job
is running.

   ------- Additional comments from uddeborg Wed Jul 27 06:55:37 -0700 2005 -------
Point taken.  All queues should be suspended.  In SuspendQueue(), there should
be no "-ne".

But in SuspendJobs() and RescheduleJobs() it would still make sense with "-ne",
wouldn't it?  These functions don't do anything with the queues themselves, only
with jobs in them, if any.

Attachments (2)

59 (2.6 KB) - added by dlove 9 years ago.
61 (1.8 KB) - added by dlove 9 years ago.

Download all attachments as: .zip

Change History (3)

Changed 9 years ago by dlove

Changed 9 years ago by dlove

comment:1 Changed 9 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed

Fixed according to RD-2005-06-20-0 in Changelog

Note: See TracTickets for help on using tickets.