[GE users] Failed receiving gdi request
Thomas.Neumann at exasol.com
Tue Aug 1 09:28:58 BST 2006
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Several weeks ago I installed a new self developed tool which submits
about 65 Jobs at once. Since this tool is running, I already had trouble
with the qmaster several times. The problem is like follows:
For some time everything runs fine, the new tool, users and older tools
submit jobs, the jobs are running and finish correctly. After some time
- up to the moment I didn't find anything to determine the exakt time
and condition - the qmaster slows down extremely (qstat takes about 1
minute instead of the normal 1 to 2 seconds ). Short time after the slow
down the whole system fails and running qstat I only receive the message
'failed receiving gdi request'.
We are currently running;
* SGE 6.0u7
* Linux-2.6 on 32Bit and 64Bit x86-Systems
* sge_qmaster is running in a 32Bit environment
Analysing the problem, I came across the following things:
* Even after the message 'failed receiving gdi request', the qmaster is
still reachable by a qping.
* The 'messages in read buffer' value grows steadily. After I knew that,
I installed a watch on the value and found out the 'failed..'- message
starts at approximately 2500 messages in read buffer, rest of qping
looks quite normal:
[qping after 'failed..' - message]
SIRM version: 0.1
SIRM message id: 1
start time: 07/19/2006 11:36:21 (1153301781)
run time[s]: 1112506
messages in read buffer: 8225 <--- This value grows rapidly.
messages in write buffer: 0
number of connected clients: 77
info: TET: R (0.15) | EDT: R (0.14) | SIGT: R (1112505.43) | MT(1): R
(0.14) | MT(2): R (0.54) | OK
* Finally I registered 'hanging' jobs (the programs running in these
jobs are running a read_nocancel for a long time) several times when the
problem occured (not always).
I made a core-dump and a strace of the qmaster when the problem appeared
some time ago, looking into them there are several gettimeof day and
futex requests in the strace, the core dump shows some rwlock_init
calls which would be matching to the strace.
Restarting the qmaster didn't bring the system back to a stable state,
only a complete system restart (qmaster and all execds shut down and
cleaned manually all job-spool dirs) was sucessful.
Does somebody know how exaktly this situation is caused and how to
prevent it ??
If required, I can send the strace, the core dump, etc. (all together
approximately 30MB) separately.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users