[GE users] node(s) temporarily unavailable

Bill Knebel billk at metrumrg.com
Thu Mar 16 22:37:22 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

mac,

Thanks for the info.  The values seem to be fine in my queue. Load 
report time is 40 seconds and max_unheard is 5 min.  This problem only 
seems to occur when submitting jobs.  We initially had our cluster setup 
with /etc/hosts mirrored across all nodes including the headnode 
(qmaster).  We recently changed to a DNS resolution system rather than 
having all nodes /etc /hosts setup the same. Wehave also moved to ipv6 
addresses rather than typical 192.168.x.x numbering.  I do not know 
enough about DNS and ipv6 to know if this may be causing the problems.

Any ideas anybody?

Bill


McCalla, Mac wrote:

>hi Bill,
>
>See "qconf -sconf" output.  Also, if you haven't done this already, the
>man pages are quite good,
>e.g. "man sge_conf" will get you lots of description of the
>configuration parameters.
>
>mac 
>
>-----Original Message-----
>From: Bill Knebel [mailto:billk at metrumrg.com] 
>Sent: Wednesday, March 15, 2006 7:41 AM
>To: users at gridengine.sunsource.net
>Subject: Re: [GE users] node(s) temporarily unavailable
>
>Can you point me in the direction of where to find those parameters?
>
>Bill
>
>McCalla, Mac wrote:
>
>  
>
>>Hi Bill,
>>
>>You might check configuration parameters to make sure that
>>load_report_time hasn't been set
>>higher than max_unheard for some reason.
>>
>>Mac McCalla 
>>
>>-----Original Message-----
>>From: Bill Knebel [mailto:billk at metrumrg.com] 
>>Sent: Tuesday, March 14, 2006 3:41 PM
>>To: users at gridengine.sunsource.net
>>Subject: [GE users] node(s) temporarily unavailable
>>
>>I get the following error in the qmaster "messages" file upon
>>    
>>
>submitting
>  
>
>>jobs when the cluster has been idle for a period of time.
>>
>>qmaster|headnode|E|got max. unheard timeout for target "execd" on host 
>>"node15", can't delivering job "25434"
>>
>>The same message is repeated for all nodes.  Eventually, the jobs move 
>>    
>>
>>from the queue onto the nodes but it does take some time. A "qstat -f" 
>  
>
>>shortly after the jobs are submitted results in many nodes being listed
>>    
>>
>
>  
>
>>with a load average of NA and a stat of "au". Eventually. all of the 
>>nodes come back and are available without any restart of sge.
>>
>>Any suggestions as to why this problem is occurring?
>>
>>Bill
>>
>> 
>>
>>    
>>
>
>  
>

-- 
Bill Knebel, PharmD, Ph.D.
Principal Scientist
Metrum Research Group
2 Tunxis Road
Suite 112
Tariffville, CT 06081
email: billk at metrumrg.com
tel: (860) 930-1370

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list