[GE users] 6.2u4 - array job problems...

ccaamad m.c.dixon at leeds.ac.uk
Wed Jan 6 12:02:28 GMT 2010

On Wed, 23 Dec 2009, templedf wrote:

> Could your file system be wonky?  Maybe the NFS server is having
> problems, or the file system is full or some such?  Generally that much
> chaos comes from an external source, usually the file system.

Yes, you're right. As we've got a box with 8 spiffy fast x64 cores 
dedicated to the scheduler, I'd naively put the scheduling interval to 1 
second. The NFS server couldn't cope with that many very short jobs 
running at the same time.

I'll put interval back up and ask my users (again) to do more work in each 
array task...

> There's nothing about an array job that should cause any problem for the
> qmaster, even one with a million tasks (unless you have several hundred
> thousand slots in your cluster).  The reason your array tasks wouldn't
> delete is that the execds had died.

Nope - the execds were still running. The problem persisted even after the 
array jobs had stopped hammering sge.

> In most cases it's a (really) bad idea to tweak anything in the
> bootstrap file, especially the thread counts.  If you were supposed to
> change those settings, they would be configuration parameters accessible
> via qconf. :)  In any case, those thread counts deal with incoming GDI
> requests.  That's not your issue.

Cheers - I'll put that back to the original settings :)

> My money is still on the file system.  In fact, I'll go double or
> nothing that you're using classic spooling and using the NFS share to
> store the spool directories.

Dead right. The people who installed the system said that they'd never 
seen a problem with classic spooling - guess they hadn't seen the right 
(wrong?) job mix.

As 6.2u5 is out/coming [Wow! _Another_ feature release??] with NUMA 
control in it, I guess this gives the opportunity to see if there's a 
better configuration for our environment. I'll send another message under 
a new heading covering this...


Mark Dixon                       Email    : m.c.dixon at leeds.ac.uk
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list