[GE users] Failed receiving gdi request

Stephan Grell - Sun Germany - SSG - Software Engineer Stephan.Grell at Sun.COM
Tue Aug 1 12:20:10 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Thomas,

since 6.0u7 we have a monitoring switch for the qmaster. It dumps out
information on what the threads are doing.

You can find information about it in the blog:

http://blogs.sun.com/roller/page/sgrell?entry=monitoring_the_qmaster


Once you have the output, you know what the qmaster is busy with.

Cheers,
Stephan

Thomas Neumann wrote:

> Hello !
>
> Several weeks ago I installed a new self developed tool which submits 
> about 65 Jobs at once. Since this tool is running, I already had 
> trouble with the qmaster several times. The problem is like follows:
>
> For some time everything runs fine, the new tool, users and older 
> tools submit jobs, the jobs are running and finish correctly. After 
> some time - up to the moment I didn't find anything to determine the 
> exakt time and condition - the qmaster slows down extremely (qstat 
> takes about 1 minute instead of the normal 1 to 2 seconds ). Short 
> time after the slow down the whole system fails and running qstat I 
> only receive the message 'failed receiving gdi request'.
>
> We are currently running;
> * SGE 6.0u7
> * Linux-2.6 on 32Bit and 64Bit x86-Systems
> * sge_qmaster is running in a 32Bit environment
>
> Analysing the problem, I came across the following things:
> * Even after the message 'failed receiving gdi request', the qmaster 
> is still reachable by a qping.
> * The 'messages in read buffer' value grows steadily. After I knew 
> that, I installed a watch on the value and found out the 'failed..'- 
> message starts at approximately 2500 messages in read buffer, rest of 
> qping looks quite normal:
>
> [qping after 'failed..' - message]
> SIRM version: 0.1
> SIRM message id: 1
> start time: 07/19/2006 11:36:21 (1153301781)
> run time[s]: 1112506
> messages in read buffer: 8225       <--- This value grows rapidly.
> messages in write buffer: 0
> number of connected clients: 77
> status: 0
> info: TET: R (0.15) | EDT: R (0.14) | SIGT: R (1112505.43) | MT(1): R 
> (0.14) | MT(2): R (0.54) | OK
> Monitor: disabled
>
> * Finally I registered 'hanging' jobs (the programs running in these 
> jobs are running a read_nocancel for a long time) several times when 
> the problem occured (not always).
>
> I made a core-dump and a strace of the qmaster when the problem 
> appeared some time ago, looking into them there are several gettimeof 
> day and futex requests in the strace,  the core dump shows some 
> rwlock_init calls which would be matching to the strace.
>
> Restarting the qmaster didn't bring the system back to a stable state, 
> only a complete system restart (qmaster and all execds shut down and 
> cleaned manually all job-spool dirs) was sucessful.
>
> Does somebody know how exaktly this situation is caused and how to 
> prevent it ??
>
> Thanks,
>   Thomas
>
>
> P.S.:
> If required, I can send the strace, the core dump, etc. (all together 
> approximately 30MB) separately.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list