[GE users] 6.2u4 - array job problems...

ccaamad m.c.dixon at leeds.ac.uk
Wed Dec 23 12:17:47 GMT 2009


Well, I'm a few hours into a holiday and our new sge6.2u4 cluster has 
started to have problems on a ~600 core cluster. I was wondering if anyone 
had any advice!

* Everything was ok when we had a few parallel jobs running.

* I've now got a task array user, who had a ~1,000,000 array to put 
through the system, where each task is quite short (order of seconds).

The system seemed to be coping, but I asked the task array user to 
resubmit, but with each task doing a bit more (since a lot of time was 
being lost in scheduling and batch startup/end).

I look at it this morning and all hell seems to have broken loose:

* Some tasks from the old job that task array user attempted to qdel are 
still being reported on some slots in "dr" state, hours later.

* qstat and qdel commands typically fail with the message:

failed receiving gdi request response for mid=2 (got syncron message receive timeout error).

Looking through for obvious parameters to tweak, I see that the cluster's 
bootstrap file has the following two interesting options that might help:

listener_threads        2
worker_threads          2

Would upping these be a good idea? I'm giving 20 each a go...

We've managed to qdel the task array jobs, but we've still getting endless 
messages in the qmaster's messages file of the form that seem to be stuck 
in some loop:

12/23/2009 12:13:01|worker|sched1|E|execd at c3s0b6n1.arc1.leeds.ac.uk reports running job (2929.29440/master) in queue "c3s0.q at c3s0b6n1.arc1.leeds.ac.uk" that was not supposed to be there - killing

Looking on c3s0b6n1, there are no user processes running. The only sge 
process is sge_execd, yet when I stop/start it, I get the message:

# service sgeexecd stop
configuration c3s0b6n1.arc1.leeds.ac.uk not defined
    Shutting down Grid Engine execution daemon
    Shutting down Grid Engine shepherd of job 2929.21282
    Shutting down Grid Engine shepherd of job 2929.28583
    Shutting down Grid Engine shepherd of job 2929.29440
    Shutting down Grid Engine shepherd of job 2929.41648
    Shutting down Grid Engine shepherd of job 2932.290813
    Shutting down Grid Engine shepherd of job 2933.374297
# service sgeexecd start
    starting sge_execd

Those shepherds were not running!

Any ideas?


Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list