[GE users] failed receiving gdi request

Heywood, Todd heywood at cshl.edu
Wed Apr 2 18:46:28 BST 2008


We are running an application which uses parallel make ("-pe make" and
qmake). It has been running fine for smallish, 50-100 task runs. Recently we
are testing better file servers, which allow (1) scaling up to 100-400 tasks
per application run, and (2) running 2-4 application runs simultaneously.

Now, I'm seeing "Error: failed receiving gdi request" errors from qsub,
qstat, qconf, etc. Details...

1. SGE 6.1, all spooling is local (not NFS).

2. When the GDI problem occurs, "top" shows the qmaster node CPU cranking at
100%, mostly user time.

3. qping shows "messages in read buffer" growing steadily. When I tried
large scale runs, the growth rate was such that qmaster was increasing its
memory usage really fast.

4. The only relevant thing I see in the qmaster messages file (for example)
is:

03/31/2008 23:45:52|qmaster|bhmnode2|E|acknowledge timeout after 600 seconds
for event client (schedd:1) on host "bhmnode2"


4. If you just wait for a couple of hours, the load on SGE comes back down,
*after* the application(s) finish. The applications finish with the expected
output. *However*, the jobs remain on the nodes, and qstat shows them still
in state "r". I have to force their qdel  with "-f" to get rid of them. Then
I see this (for example) in messages:

04/01/2008 19:22:53|qmaster|bhmnode2|E|execd blade260 reports running state
for job (5374165.1/1766.blade260) in queue "public.q
@blade260" while job is in state 65536

5. The local runtime directories under /tmp on local disks are left (not
deleted).


I did search the archive and did see something related in August 2006. But
there appeared to be no resolution.

I'd appreciate any idea or help. I hope that this is not an SGE limitation
which would prevent us from using SGE. We do need to scale up our
applications as we scale up our file serving capability.

What are all the messages being sent to qmaster, which fills up its read
buffer?

Thanks,

Todd Heywood


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list