[GE users] qmaster not starting

sangamesh forum.san at gmail.com
Tue Feb 10 05:23:44 GMT 2009


Hello all,

     The cluster is running with Rocks 5 and SGE 6.2. SGE is upgraded
to 6.2 long back and it was working fine.

Now sge_qmaster is failing to start and its not throwing any errors also:

# /etc/init.d/sgemaster.locuz62 start
# ps -ef | grep sge
root      5905  3229  0 10:44 pts/1    00:00:00 grep sge

The qmaster messages contain the following:

# vi /opt/n1ge62/default62/spool/qmaster/messages

01/31/2009 18:19:06|  main|locuzcluster|E|jvm thread is not running
02/01/2009 17:35:56|  main|locuzcluster|I|read job database with 0
entries in 0 seconds
02/01/2009 17:35:56|  main|locuzcluster|E|error opening file
"/opt/n1ge62/default62/spool/qmaster/./sharetree" for reading: No such
file or directory
02/01/2009 17:35:56|  main|locuzcluster|I|qmaster hard descriptor
limit is set to 8192
02/01/2009 17:35:56|  main|locuzcluster|I|qmaster soft descriptor
limit is set to 8192
02/01/2009 17:35:56|  main|locuzcluster|I|qmaster will use max. 8172
file descriptors for communication
02/01/2009 17:35:56|  main|locuzcluster|I|qmaster will accept max. 99
dynamic event clients
02/01/2009 17:35:56|  main|locuzcluster|I|starting up SGE 6.2 (lx24-amd64)
....
.....
02/01/2009 20:38:27|  main|locuzcluster|E|jvm thread is not running
02/02/2009 09:25:25|  main|locuzcluster|I|read job database with 0
entries in 0 seconds
02/02/2009 09:25:25|  main|locuzcluster|E|error opening file
"/opt/n1ge62/default62/spool/qmaster/./sharetree" for reading: No such
file or directory
02/02/2009 09:25:25|  main|locuzcluster|I|qmaster hard descriptor
limit is set to 8192
02/02/2009 09:25:25|  main|locuzcluster|I|qmaster soft descriptor
limit is set to 8192
02/02/2009 09:25:25|  main|locuzcluster|I|qmaster will use max. 8172
file descriptors for communication
02/02/2009 09:25:25|  main|locuzcluster|I|qmaster will accept max. 99
dynamic event clients
02/02/2009 09:25:25|  main|locuzcluster|I|starting up SGE 6.2 (lx24-amd64)
02/02/2009 09:26:06|worker|locuzcluster|E|no execd known on host
locuzcluster.org to send conf notification
02/02/2009 09:26:46|worker|locuzcluster|E|no execd known on host
locuzcluster.org to send conf notification
..
02/02/2009 12:26:56|worker|locuzcluster|E|no execd known on host
locuzcluster.org to send conf notification
02/02/2009 12:27:36|worker|locuzcluster|E|no execd known on host
locuzcluster.org to send conf notification
02/02/2009 15:30:46|  main|locuzcluster|E|jvm thread is not running

I think hostname of the master is also not changed:
# cat /opt/n1ge62/default62/common/act_qmaster
locuzcluster

127.0.0.1       localhost.localdomain   localhost
172.16.99.1     locuzcluster.local locuzcluster # originally frontend-0-0
172.16.99.254   compute-0-0.local compute-0-0 c0-0
10.129.150.45   locuzcluster.org

SGE is not giving errors. So I'm not getting why its not able to start.
Can someone tell me what could be the problem for this?

Thanks,
Sangamesh

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=103088

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list