[GE users] qmaster not starting

sangamesh forum.san at gmail.com
Tue Feb 10 13:46:05 GMT 2009


On Tue, Feb 10, 2009 at 3:49 PM, reuti <reuti at staff.uni-marburg.de> wrote:
> Hi,
>
> Am 10.02.2009 um 06:23 schrieb sangamesh:
>
>>      The cluster is running with Rocks 5 and SGE 6.2. SGE is upgraded
>> to 6.2 long back and it was working fine.
>>
>> Now sge_qmaster is failing to start and its not throwing any errors
>> also:
>>
>> # /etc/init.d/sgemaster.locuz62 start
>> # ps -ef | grep sge
>> root      5905  3229  0 10:44 pts/1    00:00:00 grep sge
>>
>> The qmaster messages contain the following:
>>
>> # vi /opt/n1ge62/default62/spool/qmaster/messages
>>
>> 01/31/2009 18:19:06|  main|locuzcluster|E|jvm thread is not running
>> 02/01/2009 17:35:56|  main|locuzcluster|I|read job database with 0
>> entries in 0 seconds
>> 02/01/2009 17:35:56|  main|locuzcluster|E|error opening file
>> "/opt/n1ge62/default62/spool/qmaster/./sharetree" for reading: No such
>> file or directory
>> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster hard descriptor
>> limit is set to 8192
>> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster soft descriptor
>> limit is set to 8192
>> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster will use max. 8172
>> file descriptors for communication
>> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster will accept max. 99
>> dynamic event clients
>> 02/01/2009 17:35:56|  main|locuzcluster|I|starting up SGE 6.2 (lx24-
>> amd64)
>> ....
>> .....
>> 02/01/2009 20:38:27|  main|locuzcluster|E|jvm thread is not running
>> 02/02/2009 09:25:25|  main|locuzcluster|I|read job database with 0
>> entries in 0 seconds
>> 02/02/2009 09:25:25|  main|locuzcluster|E|error opening file
>> "/opt/n1ge62/default62/spool/qmaster/./sharetree" for reading: No such
>> file or directory
>> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster hard descriptor
>> limit is set to 8192
>> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster soft descriptor
>> limit is set to 8192
>> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster will use max. 8172
>> file descriptors for communication
>> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster will accept max. 99
>> dynamic event clients
>> 02/02/2009 09:25:25|  main|locuzcluster|I|starting up SGE 6.2 (lx24-
>> amd64)
>> 02/02/2009 09:26:06|worker|locuzcluster|E|no execd known on host
>> locuzcluster.org to send conf notification
>> 02/02/2009 09:26:46|worker|locuzcluster|E|no execd known on host
>> locuzcluster.org to send conf notification
>> ..
>> 02/02/2009 12:26:56|worker|locuzcluster|E|no execd known on host
>> locuzcluster.org to send conf notification
>> 02/02/2009 12:27:36|worker|locuzcluster|E|no execd known on host
>> locuzcluster.org to send conf notification
>> 02/02/2009 15:30:46|  main|locuzcluster|E|jvm thread is not running
>>
>> I think hostname of the master is also not changed:
>> # cat /opt/n1ge62/default62/common/act_qmaster
>> locuzcluster
>>
>> 127.0.0.1       localhost.localdomain   localhost
>> 172.16.99.1     locuzcluster.local locuzcluster # originally
>> frontend-0-0
>> 172.16.99.254   compute-0-0.local compute-0-0 c0-0
>> 10.129.150.45   locuzcluster.org
>>
>> SGE is not giving errors. So I'm not getting why its not able to
>> start.
>> Can someone tell me what could be the problem for this?
>
> is there any file in /tmp with a panic message of the qmaster?
>
The content of /tmp:

# ls /tmp/
cpu_temp_fan_speed.last  gconfd-root                     post-99-done.debug
dstate                   hostqueue370
pre-09-prep-kernel-source.debug
ekopath_crash_rQe098     hsperfdata_root
pre-10-src-install.debug
execd_messages.2490      iptest                          screens
execd_messages.2491      keyring-JrHxhR                   ssh-NfUDH29822
execd_messages.2511      mapping-root                     virtual-root.uOdIAm
execd_messages.2589      mpd2.logfile_root
execd_messages.2594      orbit-root         post-50-insert-pci.debug
expect.log                      post-50-news.debug


# cat /tmp/execd_messages.2594
02/02/2009 15:33:49|  main|locuzcluster|E|can't connect to service
02/02/2009 15:33:49|  main|locuzcluster|E|can't get configuration from
qmaster -- backgrounding
02/02/2009 15:33:51|  main|locuzcluster|E|commlib error: can't connect
to service (Connection refused)
02/02/2009 15:34:54|  main|locuzcluster|E|getting configuration:
unable to contact qmaster using port 538 on host "locuzcluster"

# cat /tmp/hostqueue370
group_name  @allhosts
hostlist    NONE

I don't think this is a panic message from gridengine.

Thanks,
Sangamesh
> -- Reuti
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=103138
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=103190

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list