Opened 9 years ago

Last modified 6 years ago

#510 new defect

IZ2556: "failed receiving gdi request" due to message backlog with large qmake jobs

Reported by: andreas Owned by:
Priority: high Milestone:
Component: sge Version: 6.1
Severity: Keywords: PC Linux qmaster
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2556]

        Issue #:      2556             Platform:     PC       Reporter: andreas (andreas)
       Component:     gridengine          OS:        Linux
     Subcomponent:    qmaster          Version:      6.1         CC:    None defined
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    ernst (ernst)
      QA Contact:     ernst
          URL:        http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=24088
       * Summary:     "failed receiving gdi request" due to message backlog with large qmake jobs
   Status whiteboard:
      Attachments:

     Issue 2556 blocks:
   Votes for issue 2556:


   Opened: Fri Apr 18 01:59:00 -0700 2008 
------------------------


We are running an application which uses parallel make ("-pe make" and
qmake). It has been running fine for smallish, 50-100 task runs. Recently we
are testing better file servers, which allow (1) scaling up to 100-400 tasks
per application run, and (2) running 2-4 application runs simultaneously.

Now, I'm seeing "Error: failed receiving gdi request" errors from qsub,
qstat, qconf, etc. Details...

1. SGE 6.1, all spooling is local (not NFS).

2. When the GDI problem occurs, "top" shows the qmaster node CPU cranking at
100%, mostly user time.

3. qping shows "messages in read buffer" growing steadily. When I tried
large scale runs, the growth rate was such that qmaster was increasing its
memory usage really fast.

4. The only relevant thing I see in the qmaster messages file (for example)
is:

03/31/2008 23:45:52|qmaster|bhmnode2|E|acknowledge timeout after 600 seconds
for event client (schedd:1) on host "bhmnode2"


4. If you just wait for a couple of hours, the load on SGE comes back down,
*after* the application(s) finish. The applications finish with the expected
output. *However*, the jobs remain on the nodes, and qstat shows them still
in state "r". I have to force their qdel  with "-f" to get rid of them. Then
I see this (for example) in messages:

04/01/2008 19:22:53|qmaster|bhmnode2|E|execd blade260 reports running state
for job (5374165.1/1766.blade260) in queue "public.q
@blade260" while job is in state 65536

5. The local runtime directories under /tmp on local disks are left (not
deleted).

   ------- Additional comments from crei Fri Apr 18 02:36:29 -0700 2008 -------
Changed subcomponent, since qping is working this is no communication problem

   ------- Additional comments from crei Fri Apr 18 02:37:21 -0700 2008 -------
Yes, I have to add a comment here

Change History (0)

Note: See TracTickets for help on using tickets.