[GE users] sge v6.0u3 update from v6.0u1

Steve slitster at rcn.com
Tue Mar 22 00:57:07 GMT 2005


Probably not much help, but I had something very similar with SGE 6.0 u3:

The CPU on qmaster was running at 50% - although there was nothing in 
the queue. I attempted to failover the master to a  shadow node, which 
failed miserably with "could not bind" errors and SGE just collapsed on 
the whole cluster. I then tried to restart the qmaster a number of times 
-but got "could not bind" errors.
Eventually, the qmaster came up but all the queues I'd configured were 
gone (even though I could see them in the qinstances file). I had to 
rebuild the queues and everything came back.

The qmaster ran fine for a week, but  now the CPU is up to 50% with 
nothing in the queue. I'm just waiting for the running jobs to complete 
before attempting a failover again.

Steve



Viktor Oudovenko wrote:

>Linux Suse 7.3 with new kernel:
>
>rupc-cs04b:/opt/SGE/default/common # uname -a
>Linux rupc-cs04b 2.4.28 #9 SMP Wed Dec 8 14:52:03 EST 2004 i686 unknown
>
>It is not. I just made a fresh installation of 6.0u4 and it worked but
>previous one which I want to keed as all the hosts and all the setting I
>defined there does not want to start.
>
>You know the key word here is crash. Something was written somewhere that
>qmaster does not want to start. It is not the problem of busy ports of it
>the problem that master does not start!
>Any help and ideas are welcome ! I am really running out of time.
>Best,
>v
>
>
>  
>
>>-----Original Message-----
>>From: Ovid Jacob [mailto:ovid.jacob at sun.com] 
>>Sent: Monday, March 21, 2005 17:20
>>To: users at gridengine.sunsource.net
>>Cc: Ovid.Jacob at sun.com
>>Subject: Re: [GE users] sge v6.0u3 update from v6.0u1
>>
>>
>>Viktor,
>>
>>What OS are you running?
>>
>>Check that port 536 is not used by some other procces?
>>
>>grep 536 /etc/services
>>
>>If you get a non-empty string, try changing the ports to 
>>something like
>>
>>sge_qmaster 836/tcp #SGE_PORT
>>sge_execd 837/tcp #SGE_PORT
>>
>>
>>Viktor Oudovenko wrote:
>>    
>>
>>>Hi, guys,
>>>
>>>Did anybody meet this problem:
>>>
>>>rupc-cs04b:/opt/SGE/default/spool/qmaster # 
>>>      
>>>
>>/etc/init.d/sgemaster start
>>    
>>
>>>   starting sge_qmaster
>>>   starting sge_schedd
>>>error: commlib error: got read error (closing connection)
>>>error: commlib error: can't connect to service (socket error 
>>>errno=111)
>>>error: getting configuration: unable to contact qmaster 
>>>      
>>>
>>using port 536 on
>>    
>>
>>>host "rupc-cs04b" can't get configuration from qmaster -- 
>>>      
>>>
>>waiting ...
>>    
>>
>>>error: can't connect to service
>>>can't get configuration from qmaster -- waiting ...
>>>error: can't connect to service
>>>can't get configuration from qmaster -- waiting ...
>>>error: can't connect to service
>>>error: can't get configuration from qmaster -- backgrounding
>>>
>>>
>>>After server crush I could not start SGE 6.0u1 qmaster did 
>>>      
>>>
>>not want to 
>>    
>>
>>>start. I have upgraded  6.0u1 to 6.0u3 and got the messages above.
>>>
>>>
>>>In qmaster messages I have:
>>>
>>>
>>>rupc-cs04b:/opt/SGE/default/spool/qmaster # more messages 
>>>      
>>>
>>03/21/2005 
>>    
>>
>>>15:56:47|qmaster|rupc-cs04b|E|wrong cull version, read 
>>>      
>>>
>>0x00000000, but 
>>    
>>
>>>expected actual version 0x10020000 03/21/2005 
>>>15:56:47|qmaster|rupc-cs04b|E|error in init_packbuffer: wrong cull 
>>>version rupc-cs04b:/opt/SGE/default/spool/qmaster #
>>>
>>>
>>>Any ideas how to fix this? It is VERY urgent! Please help! 
>>>      
>>>
>>Thank you 
>>    
>>
>>>any body for attention and help!
>>>
>>>Best,
>>>viktor
>>>
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>    
>>
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>      
>>>
>>-- 
>>
>>
>>take care,
>>ovid
>>
>>----------------------------------------------------------------
>>	         "Your Windows system is my other computer."
>>                            Grid Engineering
>>
>>http://namefinder.sfbay.sun.com/NameFinder?view=sunEmployees&n
>>    
>>
>fquery=ovid+jacob
>                          http://tent.sfbay:88/
>                          http://www.mishkan.com
>                          ovid.jacob at sun.com
>                          x84774 (650.786.4774)
>-----------------------------------------------------------------
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
>  
>




More information about the gridengine-users mailing list