[GE users] qmaster -- backgrounding

Richard Polich rpolich at sfbrgenetics.org
Tue Dec 9 22:43:43 GMT 2008


Thank you ...looked in .../default/spool/qmaster/messages and noticed ....

12/03/2008 15:54:01|qmaster|medusa|C|!!!!!!!!!! EV_id not found in 
element !!!!!!!!!!
12/03/2008 20:27:01|qmaster|medusa|I|read job database with 64922 
entries in 953 seconds
12/03/2008 20:31:45|qmaster|medusa|W|removing reference to no longer 
existing job 1533116 of user "lfs"
12/03/2008 20:31:46|qmaster|medusa|I|qmaster hard descriptor limit is 
set to 65536
12/03/2008 20:31:46|qmaster|medusa|I|qmaster soft descriptor limit is 
set to 65536
12/03/2008 20:31:46|qmaster|medusa|I|qmaster will use max. 65516 file 
descriptors for communication
12/03/2008 20:31:46|qmaster|medusa|I|qmaster will accept max. 99 dynamic 
event clients
12/03/2008 20:31:46|qmaster|medusa|I|starting up 6.0u8

I believe by stopping and restarting sge_qmaster and qge_schedd,  
sge_qmaster was set to background but eventually started. I also noticed 
we had ~64,000 jobs in Eqw state. I removed those with a qdel script. 
How do I change the hard and soft descriptor limit of 65536. I could not 
find where this is set. We have 2,650 processors in our ranch. Our 
qmaster and nodes are now running fine.

Sorry for the delay. Thank you, Richard

craffi wrote:
> Hi Richard,
>
> First check the qmaster messages file in $SGE_ROOT/$SGE_CELL/spool/ 
> qmaster/messages
>
> ... then look in /tmp on the qmaster host to see if there are any  
> "panic" SGE error messages. Checking selinux or other system logs  
> can't hurt as well.
>
> Potential things to look at, in the general class of things that cause  
> SGE to fail to start or to exit immediately on error would be:
>
> - /etc/hosts entry that is incorrect or conflicts with DNS
> - something odd with $SGE_ROOT/$SGE_CELL/common/act_qmaster
> - forward and reverse DNS name resolution issues on qmaster host
> - firewall blocking port
> - SELINUX being aggressive
> - an old/dead sge_qmaster daemon that has not been properly killed
> - any other old SGE execd or sge_schedd daemons improperly cleared  
> from previous startup attempts
> - filesystem permission issues or corruption
> - setuid or root_squash settings on NFS mounted filesystems
>
>
> Really the best thing is to look for something specific in a log or  
> messages file. Ideally you'd see something like "connection refused",  
> "gethostbyname() failure..." or other items that suggest a specific  
> type of problem.
>
> A last resort option is sourcing the debug files and restarting with  
> verbose debug data enabled.
>
> -Chris
>
>
>
>
> On Dec 3, 2008, at 9:49 PM, rpolich at sfbrgenetics.org wrote:
>
>   
>> I can't get sge_qmaster started. Receiving ...
>> daemonize error: timeout while waiting for daemonize state
>> #error getting configuration failed receiving gdi state
>> error: can't get configuration from qmaster -- backgrounding
>>
>> Running gridengine 6.0 u8 with classic spooling on a Solaris X86  
>> 2.10 system. Any ideas?
>> Thanks
>> Richard Polich
>>     
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91011
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=91999

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list