[GE users] failed receiving gdi request

Kevin Doman kdoman07 at gmail.com
Wed Apr 2 19:38:46 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I had the exact same problem last week. In my case, a user submit jobs that
execute a binary inside his home directory. My quick way out of this was to
create a local /binary directory on each exec hosts and copy the user's
binary file over to those local directories.  ... I don't know what's in
that file, but it fixed my problem.

K.


On Wed, Apr 2, 2008 at 12:46 PM, Heywood, Todd <heywood at cshl.edu> wrote:

> We are running an application which uses parallel make ("-pe make" and
> qmake). It has been running fine for smallish, 50-100 task runs. Recently
> we
> are testing better file servers, which allow (1) scaling up to 100-400
> tasks
> per application run, and (2) running 2-4 application runs simultaneously.
>
> Now, I'm seeing "Error: failed receiving gdi request" errors from qsub,
> qstat, qconf, etc. Details...
>
> 1. SGE 6.1, all spooling is local (not NFS).
>
> 2. When the GDI problem occurs, "top" shows the qmaster node CPU cranking
> at
> 100%, mostly user time.
>
> 3. qping shows "messages in read buffer" growing steadily. When I tried
> large scale runs, the growth rate was such that qmaster was increasing its
> memory usage really fast.
>
> 4. The only relevant thing I see in the qmaster messages file (for
> example)
> is:
>
> 03/31/2008 23:45:52|qmaster|bhmnode2|E|acknowledge timeout after 600
> seconds
> for event client (schedd:1) on host "bhmnode2"
>
>
> 4. If you just wait for a couple of hours, the load on SGE comes back
> down,
> *after* the application(s) finish. The applications finish with the
> expected
> output. *However*, the jobs remain on the nodes, and qstat shows them
> still
> in state "r". I have to force their qdel  with "-f" to get rid of them.
> Then
> I see this (for example) in messages:
>
> 04/01/2008 19:22:53|qmaster|bhmnode2|E|execd blade260 reports running
> state
> for job (5374165.1/1766.blade260) in queue "public.q
> @blade260" while job is in state 65536
>
> 5. The local runtime directories under /tmp on local disks are left (not
> deleted).
>
>
> I did search the archive and did see something related in August 2006. But
> there appeared to be no resolution.
>
> I'd appreciate any idea or help. I hope that this is not an SGE limitation
> which would prevent us from using SGE. We do need to scale up our
> applications as we scale up our file serving capability.
>
> What are all the messages being sent to qmaster, which fills up its read
> buffer?
>
> Thanks,
>
> Todd Heywood
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list