[GE users] Failed receiving gdi request
sean at duke.edu
Tue Aug 1 15:25:56 BST 2006
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Thomas Neumann wrote:
> Hello !
> Several weeks ago I installed a new self developed tool which submits
> about 65 Jobs at once. Since this tool is running, I already had trouble
> with the qmaster several times. The problem is like follows:
> For some time everything runs fine, the new tool, users and older tools
> submit jobs, the jobs are running and finish correctly. After some time
> - up to the moment I didn't find anything to determine the exakt time
> and condition - the qmaster slows down extremely (qstat takes about 1
> minute instead of the normal 1 to 2 seconds ). Short time after the slow
> down the whole system fails and running qstat I only receive the message
> 'failed receiving gdi request'.
> We are currently running;
> * SGE 6.0u7
> * Linux-2.6 on 32Bit and 64Bit x86-Systems
> * sge_qmaster is running in a 32Bit environment
> Analysing the problem, I came across the following things:
> * Even after the message 'failed receiving gdi request', the qmaster is
> still reachable by a qping.
I've seen this error a fair bit. In my case, I have SGE using classic
spooling that's shared over NFS (for shadow master failover). Sometimes
SGE gets to really liking certain jobs (especially parallel jobs) and
starts rewriting their spool information quite a bit. SGE ends up
hanging waiting on these writes and can be slow to respond and will give
the 'failed receiving gdi request' error.
I also sometimes see it when I run a command like 'qmod -sj "*"'. In
that case, the suspend does go through, but I'm guessing it takes SGE
long enough to respond that qmod thinks it failed.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users