[GE users] node(s) temporarily unavailable

Bill Knebel billk at metrumrg.com
Tue Mar 14 21:40:58 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I get the following error in the qmaster "messages" file upon submitting 
jobs when the cluster has been idle for a period of time.

qmaster|headnode|E|got max. unheard timeout for target "execd" on host 
"node15", can't delivering job "25434"

The same message is repeated for all nodes.  Eventually, the jobs move 
from the queue onto the nodes but it does take some time. A "qstat -f" 
shortly after the jobs are submitted results in many nodes being listed 
with a load average of NA and a stat of "au". Eventually. all of the 
nodes come back and are available without any restart of sge.

Any suggestions as to why this problem is occurring?

Bill

-- 
Bill Knebel, PharmD, Ph.D.
Principal Scientist
Metrum Research Group
2 Tunxis Road
Suite 112
Tariffville, CT 06081
email: billk at metrumrg.com
tel: (860) 930-1370

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list