IZ2556: "failed receiving gdi request" due to message backlog with large qmake jobs

We are running an application which uses parallel make ("-pe make" and
qmake). It has been running fine for smallish, 50-100 task runs. Recently we
are testing better file servers, which allow (1) scaling up to 100-400 tasks
per application run, and (2) running 2-4 application runs simultaneously.

Now, I'm seeing "Error: failed receiving gdi request" errors from qsub,
qstat, qconf, etc. Details...

1. SGE 6.1, all spooling is local (not NFS).

2. When the GDI problem occurs, "top" shows the qmaster node CPU cranking at
100%, mostly user time.

3. qping shows "messages in read buffer" growing steadily. When I tried
large scale runs, the growth rate was such that qmaster was increasing its
memory usage really fast.

4. The only relevant thing I see in the qmaster messages file (for example)

03/31/2008 23:45:52|qmaster|bhmnode2|E|acknowledge timeout after 600 seconds
for event client (schedd:1) on host "bhmnode2"

4. If you just wait for a couple of hours, the load on SGE comes back down,
*after* the application(s) finish. The applications finish with the expected
output. *However*, the jobs remain on the nodes, and qstat shows them still
in state "r". I have to force their qdel  with "-f" to get rid of them. Then
I see this (for example) in messages:

04/01/2008 19:22:53|qmaster|bhmnode2|E|execd blade260 reports running state
for job (5374165.1/1766.blade260) in queue "public.q
@blade260" while job is in state 65536

5. The local runtime directories under /tmp on local disks are left (not

