[GE users] qmaster -- backgrounding
rpolich at sfbrgenetics.org
Tue Dec 9 22:43:43 GMT 2008
Thank you ...looked in .../default/spool/qmaster/messages and noticed ....
12/03/2008 15:54:01|qmaster|medusa|C|!!!!!!!!!! EV_id not found in
12/03/2008 20:27:01|qmaster|medusa|I|read job database with 64922
entries in 953 seconds
12/03/2008 20:31:45|qmaster|medusa|W|removing reference to no longer
existing job 1533116 of user "lfs"
12/03/2008 20:31:46|qmaster|medusa|I|qmaster hard descriptor limit is
set to 65536
12/03/2008 20:31:46|qmaster|medusa|I|qmaster soft descriptor limit is
set to 65536
12/03/2008 20:31:46|qmaster|medusa|I|qmaster will use max. 65516 file
descriptors for communication
12/03/2008 20:31:46|qmaster|medusa|I|qmaster will accept max. 99 dynamic
12/03/2008 20:31:46|qmaster|medusa|I|starting up 6.0u8
I believe by stopping and restarting sge_qmaster and qge_schedd,
sge_qmaster was set to background but eventually started. I also noticed
we had ~64,000 jobs in Eqw state. I removed those with a qdel script.
How do I change the hard and soft descriptor limit of 65536. I could not
find where this is set. We have 2,650 processors in our ranch. Our
qmaster and nodes are now running fine.
Sorry for the delay. Thank you, Richard
> Hi Richard,
> First check the qmaster messages file in $SGE_ROOT/$SGE_CELL/spool/
> ... then look in /tmp on the qmaster host to see if there are any
> "panic" SGE error messages. Checking selinux or other system logs
> can't hurt as well.
> Potential things to look at, in the general class of things that cause
> SGE to fail to start or to exit immediately on error would be:
> - /etc/hosts entry that is incorrect or conflicts with DNS
> - something odd with $SGE_ROOT/$SGE_CELL/common/act_qmaster
> - forward and reverse DNS name resolution issues on qmaster host
> - firewall blocking port
> - SELINUX being aggressive
> - an old/dead sge_qmaster daemon that has not been properly killed
> - any other old SGE execd or sge_schedd daemons improperly cleared
> from previous startup attempts
> - filesystem permission issues or corruption
> - setuid or root_squash settings on NFS mounted filesystems
> Really the best thing is to look for something specific in a log or
> messages file. Ideally you'd see something like "connection refused",
> "gethostbyname() failure..." or other items that suggest a specific
> type of problem.
> A last resort option is sourcing the debug files and restarting with
> verbose debug data enabled.
> On Dec 3, 2008, at 9:49 PM, rpolich at sfbrgenetics.org wrote:
>> I can't get sge_qmaster started. Receiving ...
>> daemonize error: timeout while waiting for daemonize state
>> #error getting configuration failed receiving gdi state
>> error: can't get configuration from qmaster -- backgrounding
>> Running gridengine 6.0 u8 with classic spooling on a Solaris X86
>> 2.10 system. Any ideas?
>> Richard Polich
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users