[GE users] Failed receiving gdi request

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Tue Aug 1 13:55:55 BST 2006


On Tue, 1 Aug 2006, Thomas Neumann wrote:

> Hello !
>
> Thanks for your answers. Since the system was restarted this morning, it runs 
> stable at the moment. When the problem occurs next time, I will try to get 
> the relevant info.
>
> Here is the info I can give you now:
> * My new script submits 65 jobs from shell doing a qsub for each job without 
> any delay in between.
> * The qmaster and all nodes in the cluster spool to local directories
> * While the system runs stable the messages in read buffer never exeeded 150 
> even when there were about 200 jobs running. (My check script triggers alarm 
> when the messages in read buffer exeed 200). Normally the messages in read 
> buffer are even close to 0.

Ok.

>
> Unfortunately, I havn't got any data concerning the qmaster host at time of 
> failure, I will collect it the next time. I didn't register any noticeable 
> behaviour of nodes in the cluster, but I will have a closer look there the 
> next time, too.

Good. If anyhow possible please try to enhance your check script 
in a way that you have comprehensive overview on the load situation 
of your master machine e.g. run top(1) in batch mode

  # top -d2

besides it would be interesting to monitor the number of jobs in your
Grid Engine cluster. To prevent qmaster be biased by qstat command 
you could do this file system based

  # ls -altR $SGE_ROOT/default/spool/qmaster/jobs

or a corresponding db_stat(?) in case you deploy BDB.

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list