Opened 14 years ago

Last modified 9 years ago

#355 new defect

IZ2047: potential qmaster sec. fault. (resurfaced)

Reported by: danielgomez Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u7
Severity: Keywords: Sun Solaris qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2047]

        Issue #:      2047             Platform:     Sun       Reporter: danielgomez (danielgomez)
       Component:     gridengine          OS:        Solaris
     Subcomponent:    qmaster          Version:      6.0u7        CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:
       * Summary:     potential qmaster sec. fault. (resurfaced)
   Status whiteboard:
      Attachments:

     Issue 2047 blocks:
   Votes for issue 2047:


   Opened: Thu Apr 27 09:19:00 -0700 2006 
------------------------


The grid master become hung and the only snippet of information about the event
in the qmaster/messages file was:

04/26/2006 22:28:59|qmaster|apollo|E|acknowledge timeout after 1200 seconds for
event client (schedd:1) on host "apollo"
04/26/2006 22:28:59|qmaster|apollo|I|event client "scheduler" with id 1 deregistered

(below is a larger context of the error lines of interest)

04/26/2006 22:07:29|qmaster|apollo|I|job 2320121.38 finished on host
rlx-0-5-5.tigr.org
04/26/2006 22:07:42|qmaster|apollo|E|denied: host "akela.tigr.org" is neither
submit nor admin host
04/26/2006 22:07:42|qmaster|apollo|E|denied: host "akela.tigr.org" is neither
submit nor admin host
04/26/2006 22:07:44|qmaster|apollo|E|denied: host "akela.tigr.org" is neither
submit nor admin host
04/26/2006 22:07:44|qmaster|apollo|E|denied: host "akela.tigr.org" is neither
submit nor admin host
04/26/2006 22:07:59|qmaster|apollo|I|job 2320121.37 finished on host
dell-0-4-7.tigr.org
04/26/2006 22:08:12|qmaster|apollo|I|job 2320121.40 finished on host
rlx-0-2-6.tigr.org
04/26/2006 22:08:22|qmaster|apollo|I|job 2320121.41 finished on host
dell-0-4-10.tigr.org
04/26/2006 22:08:30|qmaster|apollo|I|job 2320121.39 finished on host
dell-0-4-2.tigr.org
04/26/2006 22:09:16|qmaster|apollo|E|error closing file
"/local/n1ge/tigr/common/reporting": No space left on device
04/26/2006 22:09:22|qmaster|apollo|I|job 2320121.42 finished on host
rlx-0-4-4.tigr.org
04/26/2006 22:28:59|qmaster|apollo|E|acknowledge timeout after 1200 seconds for
event client (schedd:1) on host "apollo"
04/26/2006 22:28:59|qmaster|apollo|I|event client "scheduler" with id 1 deregistered
04/27/2006 08:43:22|qmaster|frog|W|cannot resolve local configuration name
"suseme.tigr.org"
04/27/2006 08:43:22|qmaster|frog|W|cannot resolve local configuration name
"susecube.tigr.org"
04/27/2006 08:43:22|qmaster|frog|W|cannot resolve local configuration name
"dgomez-sol10.tigr.org"


I did a search and found a seemlingly related earlier bug that was fixed in 6u4.
That Issue ID is 1579

Not sure if this this is a resurfacing of the same bug or not.

This is not directly related but it may be adversely contributing to the
stability of the qmaster. These are the only errors that I could find in the
qmaster/schedd/messages

04/26/2006 16:15:58|schedd|apollo|W|Jobs 2317953 & 2317867 dispatched to
master/subordinated queues
"fast.q@dell-0-1-6.tigr.org"/"default.q@dell-0-1-6.tigr.org". Suspend on
subordinate to occur in same scheduling interval. Policy conflict!
04/27/2006 08:43:45|schedd|frog|I|using "/local/common/n1ge/tigr/spool" for
execd_spool_dir

Again, this is an independent conflict in our environment that I'll address.
Just trying to give more context at or near the point in time the qmaster hung.

Change History (0)

Note: See TracTickets for help on using tickets.