[GE users] Failed receiving gdi request

Thomas Neumann Thomas.Neumann at exasol.com
Tue Aug 1 09:28:58 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hello !

Several weeks ago I installed a new self developed tool which submits 
about 65 Jobs at once. Since this tool is running, I already had trouble 
with the qmaster several times. The problem is like follows:

For some time everything runs fine, the new tool, users and older tools 
submit jobs, the jobs are running and finish correctly. After some time 
- up to the moment I didn't find anything to determine the exakt time 
and condition - the qmaster slows down extremely (qstat takes about 1 
minute instead of the normal 1 to 2 seconds ). Short time after the slow 
down the whole system fails and running qstat I only receive the message 
'failed receiving gdi request'.

We are currently running;
* SGE 6.0u7
* Linux-2.6 on 32Bit and 64Bit x86-Systems
* sge_qmaster is running in a 32Bit environment

Analysing the problem, I came across the following things:
* Even after the message 'failed receiving gdi request', the qmaster is 
still reachable by a qping.
* The 'messages in read buffer' value grows steadily. After I knew that, 
I installed a watch on the value and found out the 'failed..'- message 
starts at approximately 2500 messages in read buffer, rest of qping 
looks quite normal:

[qping after 'failed..' - message]
SIRM version: 0.1
SIRM message id: 1
start time: 07/19/2006 11:36:21 (1153301781)
run time[s]: 1112506
messages in read buffer: 8225       <--- This value grows rapidly.
messages in write buffer: 0
number of connected clients: 77
status: 0
info: TET: R (0.15) | EDT: R (0.14) | SIGT: R (1112505.43) | MT(1): R 
(0.14) | MT(2): R (0.54) | OK
Monitor: disabled

* Finally I registered 'hanging' jobs (the programs running in these 
jobs are running a read_nocancel for a long time) several times when the 
problem occured (not always).

I made a core-dump and a strace of the qmaster when the problem appeared 
some time ago, looking into them there are several gettimeof day and 
futex requests in the strace,  the core dump shows some rwlock_init 
calls which would be matching to the strace.

Restarting the qmaster didn't bring the system back to a stable state, 
only a complete system restart (qmaster and all execds shut down and 
cleaned manually all job-spool dirs) was sucessful.

Does somebody know how exaktly this situation is caused and how to 
prevent it ??

Thanks,
   Thomas


P.S.:
If required, I can send the strace, the core dump, etc. (all together 
approximately 30MB) separately.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list