[GE users] Failed receiving gdi request

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Tue Aug 1 11:07:44 BST 2006

Hi Thomas,

On Tue, 1 Aug 2006, Thomas Neumann wrote:

> Hello !
> Several weeks ago I installed a new self developed tool which submits about 
> 65 Jobs at once. Since this tool is running, I already had trouble with the 
> qmaster several times. The problem is like follows:

How these jobs are submitted?

> For some time everything runs fine, the new tool, users and older tools 
> submit jobs, the jobs are running and finish correctly. After some time - up 
> to the moment I didn't find anything to determine the exakt time and 
> condition - the qmaster slows down extremely (qstat takes about 1 minute 
> instead of the normal 1 to 2 seconds ). Short time after the slow down the 
> whole system fails and running qstat I only receive the message 'failed 
> receiving gdi request'.
> We are currently running;
> * SGE 6.0u7
> * Linux-2.6 on 32Bit and 64Bit x86-Systems
> * sge_qmaster is running in a 32Bit environment
> Analysing the problem, I came across the following things:
> * Even after the message 'failed receiving gdi request', the qmaster is still 
> reachable by a qping.
> * The 'messages in read buffer' value grows steadily. After I knew that, I 
> installed a watch on the value and found out the 'failed..'- message starts 
> at approximately 2500 messages in read buffer, rest of qping looks quite 
> normal:

That's interesting. Have you found an indication that qmaster is particularly 
busy (swapping/high cpu load)? What kind of qmaster spooling do you use? Can 
you observe any other suspicious behaviour such as execd logging in your 
cluster while message count grows?

In principle there are only two possiblities what this can mean: Either 
certain messages sent to qmaster are not processed generally or there is
a qmaster bottleneck that causes message accumulation.

... having a means to get the list of all messages in qmaster read buffer 
would help a lot, but unfortunately we have no such means. Have you looked 
at qping -dump option? Possibly it can be used to assess what kind of 
messages accumulate over time.


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list