[GE users] qmaster not starting

reuti reuti at staff.uni-marburg.de
Tue Feb 10 10:19:23 GMT 2009


Hi,

Am 10.02.2009 um 06:23 schrieb sangamesh:

>      The cluster is running with Rocks 5 and SGE 6.2. SGE is upgraded
> to 6.2 long back and it was working fine.
>
> Now sge_qmaster is failing to start and its not throwing any errors  
> also:
>
> # /etc/init.d/sgemaster.locuz62 start
> # ps -ef | grep sge
> root      5905  3229  0 10:44 pts/1    00:00:00 grep sge
>
> The qmaster messages contain the following:
>
> # vi /opt/n1ge62/default62/spool/qmaster/messages
>
> 01/31/2009 18:19:06|  main|locuzcluster|E|jvm thread is not running
> 02/01/2009 17:35:56|  main|locuzcluster|I|read job database with 0
> entries in 0 seconds
> 02/01/2009 17:35:56|  main|locuzcluster|E|error opening file
> "/opt/n1ge62/default62/spool/qmaster/./sharetree" for reading: No such
> file or directory
> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster hard descriptor
> limit is set to 8192
> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster soft descriptor
> limit is set to 8192
> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster will use max. 8172
> file descriptors for communication
> 02/01/2009 17:35:56|  main|locuzcluster|I|qmaster will accept max. 99
> dynamic event clients
> 02/01/2009 17:35:56|  main|locuzcluster|I|starting up SGE 6.2 (lx24- 
> amd64)
> ....
> .....
> 02/01/2009 20:38:27|  main|locuzcluster|E|jvm thread is not running
> 02/02/2009 09:25:25|  main|locuzcluster|I|read job database with 0
> entries in 0 seconds
> 02/02/2009 09:25:25|  main|locuzcluster|E|error opening file
> "/opt/n1ge62/default62/spool/qmaster/./sharetree" for reading: No such
> file or directory
> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster hard descriptor
> limit is set to 8192
> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster soft descriptor
> limit is set to 8192
> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster will use max. 8172
> file descriptors for communication
> 02/02/2009 09:25:25|  main|locuzcluster|I|qmaster will accept max. 99
> dynamic event clients
> 02/02/2009 09:25:25|  main|locuzcluster|I|starting up SGE 6.2 (lx24- 
> amd64)
> 02/02/2009 09:26:06|worker|locuzcluster|E|no execd known on host
> locuzcluster.org to send conf notification
> 02/02/2009 09:26:46|worker|locuzcluster|E|no execd known on host
> locuzcluster.org to send conf notification
> ..
> 02/02/2009 12:26:56|worker|locuzcluster|E|no execd known on host
> locuzcluster.org to send conf notification
> 02/02/2009 12:27:36|worker|locuzcluster|E|no execd known on host
> locuzcluster.org to send conf notification
> 02/02/2009 15:30:46|  main|locuzcluster|E|jvm thread is not running
>
> I think hostname of the master is also not changed:
> # cat /opt/n1ge62/default62/common/act_qmaster
> locuzcluster
>
> 127.0.0.1       localhost.localdomain   localhost
> 172.16.99.1     locuzcluster.local locuzcluster # originally  
> frontend-0-0
> 172.16.99.254   compute-0-0.local compute-0-0 c0-0
> 10.129.150.45   locuzcluster.org
>
> SGE is not giving errors. So I'm not getting why its not able to  
> start.
> Can someone tell me what could be the problem for this?

is there any file in /tmp with a panic message of the qmaster?

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=103138

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list