Opened 15 years ago
Last modified 10 years ago
#355 new defect
IZ2047: potential qmaster sec. fault. (resurfaced)
Reported by: | danielgomez | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.0u7 |
Severity: | Keywords: | Sun Solaris qmaster | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2047]
Issue #: 2047 Platform: Sun Reporter: danielgomez (danielgomez) Component: gridengine OS: Solaris Subcomponent: qmaster Version: 6.0u7 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: * Summary: potential qmaster sec. fault. (resurfaced) Status whiteboard: Attachments: Issue 2047 blocks: Votes for issue 2047: Opened: Thu Apr 27 09:19:00 -0700 2006 ------------------------ The grid master become hung and the only snippet of information about the event in the qmaster/messages file was: 04/26/2006 22:28:59|qmaster|apollo|E|acknowledge timeout after 1200 seconds for event client (schedd:1) on host "apollo" 04/26/2006 22:28:59|qmaster|apollo|I|event client "scheduler" with id 1 deregistered (below is a larger context of the error lines of interest) 04/26/2006 22:07:29|qmaster|apollo|I|job 2320121.38 finished on host rlx-0-5-5.tigr.org 04/26/2006 22:07:42|qmaster|apollo|E|denied: host "akela.tigr.org" is neither submit nor admin host 04/26/2006 22:07:42|qmaster|apollo|E|denied: host "akela.tigr.org" is neither submit nor admin host 04/26/2006 22:07:44|qmaster|apollo|E|denied: host "akela.tigr.org" is neither submit nor admin host 04/26/2006 22:07:44|qmaster|apollo|E|denied: host "akela.tigr.org" is neither submit nor admin host 04/26/2006 22:07:59|qmaster|apollo|I|job 2320121.37 finished on host dell-0-4-7.tigr.org 04/26/2006 22:08:12|qmaster|apollo|I|job 2320121.40 finished on host rlx-0-2-6.tigr.org 04/26/2006 22:08:22|qmaster|apollo|I|job 2320121.41 finished on host dell-0-4-10.tigr.org 04/26/2006 22:08:30|qmaster|apollo|I|job 2320121.39 finished on host dell-0-4-2.tigr.org 04/26/2006 22:09:16|qmaster|apollo|E|error closing file "/local/n1ge/tigr/common/reporting": No space left on device 04/26/2006 22:09:22|qmaster|apollo|I|job 2320121.42 finished on host rlx-0-4-4.tigr.org 04/26/2006 22:28:59|qmaster|apollo|E|acknowledge timeout after 1200 seconds for event client (schedd:1) on host "apollo" 04/26/2006 22:28:59|qmaster|apollo|I|event client "scheduler" with id 1 deregistered 04/27/2006 08:43:22|qmaster|frog|W|cannot resolve local configuration name "suseme.tigr.org" 04/27/2006 08:43:22|qmaster|frog|W|cannot resolve local configuration name "susecube.tigr.org" 04/27/2006 08:43:22|qmaster|frog|W|cannot resolve local configuration name "dgomez-sol10.tigr.org" I did a search and found a seemlingly related earlier bug that was fixed in 6u4. That Issue ID is 1579 Not sure if this this is a resurfacing of the same bug or not. This is not directly related but it may be adversely contributing to the stability of the qmaster. These are the only errors that I could find in the qmaster/schedd/messages 04/26/2006 16:15:58|schedd|apollo|W|Jobs 2317953 & 2317867 dispatched to master/subordinated queues "fast.q@dell-0-1-6.tigr.org"/"default.q@dell-0-1-6.tigr.org". Suspend on subordinate to occur in same scheduling interval. Policy conflict! 04/27/2006 08:43:45|schedd|frog|I|using "/local/common/n1ge/tigr/spool" for execd_spool_dir Again, this is an independent conflict in our environment that I'll address. Just trying to give more context at or near the point in time the qmaster hung.
Note: See
TracTickets for help on using
tickets.