[GE users] slow responses when large jobs finish

Andy Schwierskott andy.schwierskott at sun.com
Fri Sep 17 09:52:39 BST 2004


I think this is about two completely different things:

1. Sean's mail is about 6.0u1 and (I assume) a tightly integrated parallel job. 
2. Bernard's mail is about 5.3 and array jobs.

1. I think N1GE does not make use of threads for deleting jobs yet. This
    means that entering a qdel will directly cause qmaster to signal the
    tasks of the parallel job.

    ==> a bug or an RFE should be filed ("deletion of parallel jobs makes SGE
        unresponsice" or cimilar summary)

2. about this Bernard already asked an the mailing list I think. I'm still
    in the opinion that this is not an array job issue, however
    Chances are very low that we are going to fix something for 5.3. All
    scalability/performance related improvments will be done in 6.x.

    (deletion of array jobs is still not super efficient on 6.0 of many tasks
    are running - this is a known issue, I think it's also documented in of
    the the qdel related issues in Issuezilla).


> I have noticed that if there are a huge number of jobs exiting from SGE,
> sge_commd will get hung and there is no way for it to recover.
> When users submit a huge number of array jobs (for example 30,000 tasks)
> and the jobs either finish really quickly or they are deleted in one
> shot, then bad things happen.
> We are using 5.3p6 and have stopped using array jobs ever since
> (sticking with just simple jobs).
> We have local installation of SGE on each node.
> Cheers,
> Bernard
>> -----Original Message-----
>> From: Sean Dilda [mailto:agrajag at dragaera.net]
>> Sent: Thursday, September 16, 2004 14:22
>> To: users at gridengine.sunsource.net
>> Subject: [GE users] slow responses when large jobs finish
>> Has anyone else noticed slow responses from SGE commands when
>> large jobs are finishing?  I had a user just delete about 4
>> running jobs, each one was taking up 30 slots (so 120 slots
>> total).  For a few minutes afterwords, sge_qmaster was
>> effectively unresponsive.  Commands like 'qstat' would just
>> sit there until sge_qmaster becomes responsive again.  I've
>> noticed this kinda of behavior before, but it was especially
>> bad this time.  Has anyone else noticed anything of this sort?
>> I'm running 6.0u1 with classic spooling (and sge_qmaster's
>> spool is over
>> nfs).   I ran strace and it seemed that one of the sge_qmaster threads
>> was busy doing a lot of file I/O related to the jobs that
>> were finishing.  This surprises me somewhat as I thought
>> making sge_qmaster threaded was supposed to help with
>> situations like this.  I understand that using NFS will slow
>> things down somewhat, but I can't imagine that 120 slots
>> worth of jobs would cause enough file I/O that sge_qmaster
>> would become effectively unresponsive for several minutes.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list