[GE users] qmaster dying again....

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Wed Jul 18 18:35:54 BST 2007


Hi Iwona,

On Wed, 18 Jul 2007, Iwona Sakrejda wrote:

>
>
> Andreas.Haas at Sun.COM wrote:
>> Hi Iwona,
>> 
>> On Tue, 17 Jul 2007, Iwona Sakrejda wrote:
>> 
>>> Hi,
>>> 
>>> A few days ago I upgraded from 6.0u4 to 6.0u11 and this morning my qmaster 
>>> started dying.
>> 
>> You did this as foreseen?
>>
>>    http://gridengine.sunsource.net/install60patch.txt
> Yes, all went through ok, no problems encountered during the upgrade.
> I was very happy about that.

Ok.

>> 
>> 
>>> When I look at the logs I see messages:
>>> 
>>> 7/17/2007 10:37:24|qmaster|pc2533|I|qmaster hard descriptor limit is set 
>>> to 8192
>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster soft descriptor limit is set 
>>> to 8192
>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will use max. 8172 file 
>>> descriptors for communication
>>> 07/17/2007 10:37:24|qmaster|pc2533|I|qmaster will accept max. 99 dynamic 
>>> event clients
>> 
>> That is fine. It says qmaster got enough file descriptors available.
> My cluster consists of ~250 nodes, 2CPUs each. We run 1 job/per CPU.
> We routinely have a few thousand jobs pending and in peak it goes up to ~15k.
> I am not sure what file descriptors and dynamic events are used for....

Dynamic event clients are only needed for DRMAA clients and when

    qsub -sync y

is used. Usually the 99 default is ample amount. The same is true with the 
8192 file descriptors. If you estimate 1 file descriptor for each node you 
still have 8192-250 spare fd's for client commands connecting to qmaster. 
So this one can safely exclude as root of your qmaster problem.

>> 
>>> Other than that nothing special.
>>> 
>>> Also when I restart the qmaster I get messages:
>>> [root at pc2533 qmaster]# /etc/rc.d/init.d/sgemaster start
>>>  starting sge_qmaster
>>>  starting sge_schedd
>>> daemonize error: timeout while waiting for daemonize state
>> 
>> That means scheduler is having some problem during start-up. From the 
>> message one can not say what is causing the problems, but it could be due 
>> to qmaster in-turn having problems.
> I am restarting them after the crash when the cluster is full loaded. Is it 
> possible that it just needs more time to re-read all
> the info about running and pending jobs?

Actually this I would rule out.

> Where would the scheduler print any 
> messages about problems it is having?

For investigating the problem I suggest you launch qmaster and scheduler 
separately as binaries rather than using sgemaster script. All you need 
is two root-shells with Grid Engine environment (settings.{sh|csh}) be 
set.

Then you do this:

    # setenv SGE_ND
    # $SGE_ROOT/bin/lx24-x86/sge_qmaster

if you see everything wen't well with qmaster start-up (e.g. test whether 
qhost gets you reasonable output) you continue with launching the scheduler 
from the other shell:

    # setenv SGE_ND
    # $SGE_ROOT/bin/lx24-x86/sge_schedd

but my expectation is already qmaster will report some problem and exit.
Normally qmaster may not exit with SGE_ND in environemnt as it prevents 
daemonizing.

>>>  starting sge_shadowd
>>> error: getting configuration: failed receiving gdi request
>> 
>> Next indication for a crashed or sick qmaster.
>>
>>>  starting up GE 6.0u11 (lx24-x86)
>>> 
>>> How bad is any of that, could crashes be related to it?
>> 
>> Very likely.
>> 
>>> I am running on RHEL3 .
>> 
>> Have you tried some other OS?
> We will be upgrading shortly but at this time I have no choice, I have to 
> keep the cluster
> running with the OS I have.
>
> Yesterday I gathered some more empirical evidence about the crashes - might 
> be just
> a coincidence. The story is long and related to a filesystem we are using 
> (GPFS) but here is the part related to SGE.

Actually I'm not aware of any problem with GPFS, but it could be related.
Is qmaster spooling located on the GPFS volume? Are you using classic or 
BDB spooling?


> Sometimes on the client host the filesystem daemons get killed and that 
> leaves the SGE processes on the client defunct - still there, but master cannot 
> communicate with them. qdel will not dispose of the user's job, the load is not reported.
> The easiest is to just reboot the node - it does not happen very often,
> just a few nodes per day at most.
>
> But even if I reboot the node, the client will not start properly unless I 
> clean the local spool directory. I did not figure out which files are interfering, 
> but if I delete the whole local spool,  the directory gets recreated and 
> everybody is ok, so that's what I have been doing. Reboot, delete the local spool 
> subdirectory, restart the SGE client.

Usually there are no problems with execution nodes if local spooling is 
used. Ugh!


> Yesterday I decided to streamline my procedure and delete that local
> spool directory, before I reboot the node. The moment I delete that local
> spool, the master that runs on a different host crashes right away.
>
> I managed to crash it a few times, then I went to my old procedure
> - first reboot, then remove the local scratch and all has been running well.
>
> (the startup messages about problems are still there, but once started SGE 
> run well and
> I do not see any other problems).

Bah, Ugh, Igitt!!! Well, it sounds as if it were a good idea to move away
from GPFS ... at least for SGE spooling. Can't you switch to a more 
conventional FS for that purpose?

Regards,
Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list