Opened 13 years ago
Last modified 10 years ago
#510 new defect
IZ2556: "failed receiving gdi request" due to message backlog with large qmake jobs
Reported by: | andreas | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | sge | Version: | 6.1 |
Severity: | Keywords: | PC Linux qmaster | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=2556]
Issue #: 2556 Platform: PC Reporter: andreas (andreas) Component: gridengine OS: Linux Subcomponent: qmaster Version: 6.1 CC: None defined Status: NEW Priority: P2 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: ernst (ernst) QA Contact: ernst URL: http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=24088 * Summary: "failed receiving gdi request" due to message backlog with large qmake jobs Status whiteboard: Attachments: Issue 2556 blocks: Votes for issue 2556: Opened: Fri Apr 18 01:59:00 -0700 2008 ------------------------ We are running an application which uses parallel make ("-pe make" and qmake). It has been running fine for smallish, 50-100 task runs. Recently we are testing better file servers, which allow (1) scaling up to 100-400 tasks per application run, and (2) running 2-4 application runs simultaneously. Now, I'm seeing "Error: failed receiving gdi request" errors from qsub, qstat, qconf, etc. Details... 1. SGE 6.1, all spooling is local (not NFS). 2. When the GDI problem occurs, "top" shows the qmaster node CPU cranking at 100%, mostly user time. 3. qping shows "messages in read buffer" growing steadily. When I tried large scale runs, the growth rate was such that qmaster was increasing its memory usage really fast. 4. The only relevant thing I see in the qmaster messages file (for example) is: 03/31/2008 23:45:52|qmaster|bhmnode2|E|acknowledge timeout after 600 seconds for event client (schedd:1) on host "bhmnode2" 4. If you just wait for a couple of hours, the load on SGE comes back down, *after* the application(s) finish. The applications finish with the expected output. *However*, the jobs remain on the nodes, and qstat shows them still in state "r". I have to force their qdel with "-f" to get rid of them. Then I see this (for example) in messages: 04/01/2008 19:22:53|qmaster|bhmnode2|E|execd blade260 reports running state for job (5374165.1/1766.blade260) in queue "public.q @blade260" while job is in state 65536 5. The local runtime directories under /tmp on local disks are left (not deleted). ------- Additional comments from crei Fri Apr 18 02:36:29 -0700 2008 ------- Changed subcomponent, since qping is working this is no communication problem ------- Additional comments from crei Fri Apr 18 02:37:21 -0700 2008 ------- Yes, I have to add a comment here
Note: See
TracTickets for help on using
tickets.