[GE users] qmaster not starting

reuti reuti at staff.uni-marburg.de
Tue Feb 10 22:30:54 GMT 2009


Am 10.02.2009 um 14:46 schrieb sangamesh:

> On Tue, Feb 10, 2009 at 3:49 PM, reuti <reuti at staff.uni-marburg.de>  
> wrote:
>> Hi,
>>
>> Am 10.02.2009 um 06:23 schrieb sangamesh:
>>
>>>      The cluster is running with Rocks 5 and SGE 6.2. SGE is  
>>> upgraded
>>> to 6.2 long back and it was working fine.
>>>
>>> Now sge_qmaster is failing to start and its not throwing any errors
>>> also:
>>>
>>> # /etc/init.d/sgemaster.locuz62 start
>>> # ps -ef | grep sge
>>> root      5905  3229  0 10:44 pts/1    00:00:00 grep sge
>>>
>>> The qmaster messages contain the following:
>>>
>>> # vi /opt/n1ge62/default62/spool/qmaster/messages
>>>
>>> 01/31/2009 18:19:06|  main|locuzcluster|E|jvm thread is not running
>>> 02/01/2009 17:35:56|  main|locuzcluster|I|read job database with 0
>>> entries in 0 seconds
>>> 02/01/2009 17:35:56|  main|locuzcluster|E|error opening file
>>> "/opt/n1ge62/default62/spool/qmaster/./sharetree" for reading: No  
>>> such
>>> file or directory
>>> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster hard descriptor
>>> limit is set to 8192
>>> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster soft descriptor
>>> limit is set to 8192
>>> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster will use max. 8172
>>> file descriptors for communication
>>> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster will accept  
>>> max. 99
>>> dynamic event clients
>>> 02/01/2009 17:35:56|  main|locuzcluster|I|starting up SGE 6.2 (lx24-
>>> amd64)
>>> ....
>>> .....
>>> 02/01/2009 20:38:27|  main|locuzcluster|E|jvm thread is not running
>>> 02/02/2009 09:25:25|  main|locuzcluster|I|read job database with 0
>>> entries in 0 seconds
>>> 02/02/2009 09:25:25|  main|locuzcluster|E|error opening file
>>> "/opt/n1ge62/default62/spool/qmaster/./sharetree" for reading: No  
>>> such
>>> file or directory
>>> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster hard descriptor
>>> limit is set to 8192
>>> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster soft descriptor
>>> limit is set to 8192
>>> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster will use max. 8172
>>> file descriptors for communication
>>> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster will accept  
>>> max. 99
>>> dynamic event clients
>>> 02/02/2009 09:25:25|  main|locuzcluster|I|starting up SGE 6.2 (lx24-
>>> amd64)
>>> 02/02/2009 09:26:06|worker|locuzcluster|E|no execd known on host
>>> locuzcluster.org to send conf notification
>>> 02/02/2009 09:26:46|worker|locuzcluster|E|no execd known on host
>>> locuzcluster.org to send conf notification
>>> ..
>>> 02/02/2009 12:26:56|worker|locuzcluster|E|no execd known on host
>>> locuzcluster.org to send conf notification
>>> 02/02/2009 12:27:36|worker|locuzcluster|E|no execd known on host
>>> locuzcluster.org to send conf notification
>>> 02/02/2009 15:30:46|  main|locuzcluster|E|jvm thread is not running
>>>
>>> I think hostname of the master is also not changed:
>>> # cat /opt/n1ge62/default62/common/act_qmaster
>>> locuzcluster
>>>
>>> 127.0.0.1       localhost.localdomain   localhost
>>> 172.16.99.1     locuzcluster.local locuzcluster # originally
>>> frontend-0-0
>>> 172.16.99.254   compute-0-0.local compute-0-0 c0-0
>>> 10.129.150.45   locuzcluster.org
>>>
>>> SGE is not giving errors. So I'm not getting why its not able to
>>> start.
>>> Can someone tell me what could be the problem for this?
>>
>> is there any file in /tmp with a panic message of the qmaster?
>>
> The content of /tmp:
>
> # ls /tmp/
> cpu_temp_fan_speed.last  gconfd-root                     post-99- 
> done.debug
> dstate                   hostqueue370
> pre-09-prep-kernel-source.debug
> ekopath_crash_rQe098     hsperfdata_root
> pre-10-src-install.debug
> execd_messages.2490      iptest                          screens
> execd_messages.2491      keyring-JrHxhR                   ssh- 
> NfUDH29822
> execd_messages.2511      mapping-root                     virtual- 
> root.uOdIAm
> execd_messages.2589      mpd2.logfile_root
> execd_messages.2594      orbit-root         post-50-insert-pci.debug
> expect.log                      post-50-news.debug
>
>
> # cat /tmp/execd_messages.2594
> 02/02/2009 15:33:49|  main|locuzcluster|E|can't connect to service
> 02/02/2009 15:33:49|  main|locuzcluster|E|can't get configuration from
> qmaster -- backgrounding
> 02/02/2009 15:33:51|  main|locuzcluster|E|commlib error: can't connect
> to service (Connection refused)
> 02/02/2009 15:34:54|  main|locuzcluster|E|getting configuration:
> unable to contact qmaster using port 538 on host "locuzcluster"
>
> # cat /tmp/hostqueue370
> group_name  @allhosts
> hostlist    NONE

Right, nothing there. Did you change any RQS before this happened?

-- Reuti


> I don't think this is a panic message from gridengine.
>
> Thanks,
> Sangamesh
>> -- Reuti
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=103138
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=103190
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=103279

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list