Opened 15 years ago

Last modified 11 years ago

#380 new defect

IZ2109: sgemaster commlib error, can't connect to service with data in /sgeroot/default/spool/qmaster/jobs/00

Reported by: aahook Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u8
Severity: Keywords: Linux qmaster


[Imported from gridengine issuezilla]

        Issue #:      2109             Platform:     All      Reporter: aahook (aahook)
       Component:     gridengine          OS:        Linux
     Subcomponent:    qmaster          Version:      6.0u8       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
       * Summary:     sgemaster commlib error, can't connect to service with data in /sgeroot/default/spool/qmaster/jobs/00
   Status whiteboard:
                      Date/filename:                           Description:                            Submitted by:
                      Tue Nov 7 09:06:00 -0700 2006: 00.tar.gz contents of jobs directory (text/plain) aahook

     Issue 2109 blocks:
   Votes for issue 2109:

   Opened: Mon Nov 6 12:15:00 -0700 2006 

I have an intermittant problem with 6.0u8. I have a user that submits many
parallel jobs and on occassion the submission host loses contact with the
qmaster.  The messages they get are:
error: commlib error: can't connect to service (Connection refused)
error: getting configuration: unable to contact qmaster using port 701 on host

When I check on the grid master it loses connection the grid processes on itself
with the same messages. I can bring grid down with /etc/init.d/sgemaster stop. I
also have to kill a sched process. But, when I try to start it I get the same
message as above along with:
error: can't get configuration from qmaster -- backgrounding

The only way that I can get grid to start back up is to remove data under the
directory /SGEROOT/default/spool/qmaster/jobs/00

I cannot reproduce the problem that initially caused the issue, but I can
reproduce the startup issues by occupying the 00 directory.

From looking, I have seen similar issues in that my gridmaster is an RHE 3 system.

   ------- Additional comments from aahook Mon Nov 6 12:50:40 -0700 2006 -------
In addition, all my queue information is lost when I bring qmon back up.

   ------- Additional comments from andreas Tue Nov 7 04:14:49 -0700 2006 -------
You write you can reproduce the start-up issue "by occupying" your


what exactly do you mean with this? What would I need to do to reproduce
the your start-up?


   ------- Additional comments from andreas Tue Nov 7 04:16:23 -0700 2006 -------
... your start-up /issue/.

   ------- Additional comments from aahook Tue Nov 7 09:06:37 -0700 2006 -------
Created an attachment (id=81)
contents of jobs directory

   ------- Additional comments from aahook Tue Nov 7 09:12:53 -0700 2006 -------
I have attached the contents of what was in my jobs directory for the qmaster. I
dont' know if this is possible, but if you place this in your jobs directory and
stop and try to restart you grid server processes you should see the problems again.

I am a little afraid to reproduce the problem as I rebuilt grid last night and
didn't want to try that exercise any time soon.

About that, can I take a snap shot of my grid directories and in case of
accident restore the snapshot.  Will that maintain all of my settings?


Attachments (1)

81 (6.7 KB) - added by dlove 11 years ago.

Download all attachments as: .zip

Change History (1)

Changed 11 years ago by dlove

Note: See TracTickets for help on using tickets.