[GE users] 6.2u4 - array job problems...
dan.templeton at sun.com
Wed Dec 23 14:30:07 GMT 2009
Could your file system be wonky? Maybe the NFS server is having
problems, or the file system is full or some such? Generally that much
chaos comes from an external source, usually the file system.
There's nothing about an array job that should cause any problem for the
qmaster, even one with a million tasks (unless you have several hundred
thousand slots in your cluster). The reason your array tasks wouldn't
delete is that the execds had died.
In most cases it's a (really) bad idea to tweak anything in the
bootstrap file, especially the thread counts. If you were supposed to
change those settings, they would be configuration parameters accessible
via qconf. :) In any case, those thread counts deal with incoming GDI
requests. That's not your issue.
My money is still on the file system. In fact, I'll go double or
nothing that you're using classic spooling and using the NFS share to
store the spool directories.
> Well, I'm a few hours into a holiday and our new sge6.2u4 cluster has
> started to have problems on a ~600 core cluster. I was wondering if anyone
> had any advice!
> * Everything was ok when we had a few parallel jobs running.
> * I've now got a task array user, who had a ~1,000,000 array to put
> through the system, where each task is quite short (order of seconds).
> The system seemed to be coping, but I asked the task array user to
> resubmit, but with each task doing a bit more (since a lot of time was
> being lost in scheduling and batch startup/end).
> I look at it this morning and all hell seems to have broken loose:
> * Some tasks from the old job that task array user attempted to qdel are
> still being reported on some slots in "dr" state, hours later.
> * qstat and qdel commands typically fail with the message:
> failed receiving gdi request response for mid=2 (got syncron message receive timeout error).
> Looking through for obvious parameters to tweak, I see that the cluster's
> bootstrap file has the following two interesting options that might help:
> listener_threads 2
> worker_threads 2
> Would upping these be a good idea? I'm giving 20 each a go...
> We've managed to qdel the task array jobs, but we've still getting endless
> messages in the qmaster's messages file of the form that seem to be stuck
> in some loop:
> 12/23/2009 12:13:01|worker|sched1|E|execd at c3s0b6n1.arc1.leeds.ac.uk reports running job (2929.29440/master) in queue "c3s0.q at c3s0b6n1.arc1.leeds.ac.uk" that was not supposed to be there - killing
> Looking on c3s0b6n1, there are no user processes running. The only sge
> process is sge_execd, yet when I stop/start it, I get the message:
> # service sgeexecd stop
> configuration c3s0b6n1.arc1.leeds.ac.uk not defined
> Shutting down Grid Engine execution daemon
> Shutting down Grid Engine shepherd of job 2929.21282
> Shutting down Grid Engine shepherd of job 2929.28583
> Shutting down Grid Engine shepherd of job 2929.29440
> Shutting down Grid Engine shepherd of job 2929.41648
> Shutting down Grid Engine shepherd of job 2932.290813
> Shutting down Grid Engine shepherd of job 2933.374297
> # service sgeexecd start
> starting sge_execd
> Those shepherds were not running!
> Any ideas?
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users