Opened 14 years ago

Last modified 8 years ago

#233 new defect

IZ1517: qmaster is not accepting connections if number of execd's exceed number of file descriptors

Reported by: crei Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u3
Severity: Keywords: communication
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1517]

        Issue #:      1517                Platform:     All      Reporter: crei (crei)
       Component:     gridengine             OS:        All
     Subcomponent:    communication       Version:      6.0u3       CC:    None defined
        Status:       NEW                 Priority:     P3
      Resolution:                        Issue type:    DEFECT
                                      Target milestone: 6.2
      Assigned to:    crei (crei)
      QA Contact:     crei
          URL:
       * Summary:     qmaster is not accepting connections if number of execd's exceed number of file descriptors
   Status whiteboard:
      Attachments:

     Issue 1517 blocks:
   Votes for issue 1517:


   Opened: Tue Mar 22 10:51:00 -0700 2005 
------------------------


If the number of file descriptors is limited to a value which is below the
number of execd's there is a erratic and practically unusbale behavior of qmaster:

Response from the qmaster running ulimit -n1000 is sporadic at best.  sometimes
the command works immediately, sometimes it hangs for a while and works,
sometimes it terminates with
msgs's
"got connect timeout: connect timeout error"

The internal commlib file descriptor limit (file descriptors used for communication)
seems to be broken.

   ------- Additional comments from andreas Fri Apr 15 05:47:15 -0700 2005 -------
WORKAROUND:
The number of filedescriptors available with a qmaster is logged in message file.
To overcome the problem one must ensure there are more file descriptors available
for qmaster than execd's in the cluster. Two times the number of execds
certainly is safe.

   ------- Additional comments from andreas Mon Apr 25 03:12:30 -0700 2005 -------
*** Issue 1581 has been marked as a duplicate of this issue. ***

   ------- Additional comments from sgrell Tue Dec 6 08:16:35 -0700 2005 -------
Changed the Subcomponent.

Stephan

   ------- Additional comments from joga Thu Aug 2 08:48:43 -0700 2007 -------
planning to fix it in 6.2.

Change History (0)

Note: See TracTickets for help on using tickets.