[GE users] qmaster dying again....
isakrejda at lbl.gov
Wed Jul 18 15:36:37 BST 2007
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
Andreas.Haas at Sun.COM wrote:
> Hi Iwona,
> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning my
>> qmaster started dying.
> You did this as foreseen?
Yes, all went through ok, no problems encountered during the upgrade.
I was very happy about that.
>> When I look at the logs I see messages:
>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor limit is
>> set to 8192
>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor limit is
>> set to 8192
>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 file
>> descriptors for communication
>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99
>> dynamic event clients
> That is fine. It says qmaster got enough file descriptors available.
My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
We routinely have a few thousand jobs pending and in peak it goes up to
I am not sure what file descriptors and dynamic events are used for....
>> Other than that nothing special.
>> Also when I restart the qmaster I get messages:
>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>> starting sge_qmaster
>> starting sge_schedd
>> daemonize error: timeout while waiting for daemonize state
> That means scheduler is having some problem during start-up. From the
> message one can not say what is causing the problems, but it could be
> due to qmaster in-turn having problems.
I am restarting them after the crash when the cluster is full loaded. Is
it possible that it just needs more time to re-read all
the info about running and pending jobs? Where would the scheduler print
any messages about problems it is having?
>> starting sge_shadowd
>> error: getting configuration: failed receiving gdi request
> Next indication for a crashed or sick qmaster.
>> starting up GE 6.0u11 (lx24-x86)
>> How bad is any of that, could crashes be related to it?
> Very likely.
>> I am running on RHEL3 .
> Have you tried some other OS?
We will be upgrading shortly but at this time I have no choice, I have
to keep the cluster
running with the OS I have.
Yesterday I gathered some more empirical evidence about the crashes -
might be just
a coincidence. The story is long and related to a filesystem we are
but here is the part related to SGE.
Sometimes on the client host the filesystem daemons get killed and that
the SGE processes on the client defunct - still there, but master cannot
with them. qdel will not dispose of the user's job, the load is not
The easiest is to just reboot the node - it does not happen very often,
just a few nodes per day at most.
But even if I reboot the node, the client will not start properly unless
the local spool directory. I did not figure out which files are
if I delete the whole local spool, the directory gets recreated and
is ok, so that's what I have been doing. Reboot, delete the local spool
restart the SGE client.
Yesterday I decided to streamline my procedure and delete that local
spool directory, before I reboot the node. The moment I delete that local
spool, the master that runs on a different host crashes right away.
I managed to crash it a few times, then I went to my old procedure
- first reboot, then remove the local scratch and all has been running well.
(the startup messages about problems are still there, but once started
SGE run well and
I do not see any other problems).
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users