[GE users] slow responses when large jobs finish
andy.schwierskott at sun.com
Fri Sep 17 09:52:39 BST 2004
I think this is about two completely different things:
1. Sean's mail is about 6.0u1 and (I assume) a tightly integrated parallel job.
2. Bernard's mail is about 5.3 and array jobs.
1. I think N1GE does not make use of threads for deleting jobs yet. This
means that entering a qdel will directly cause qmaster to signal the
tasks of the parallel job.
==> a bug or an RFE should be filed ("deletion of parallel jobs makes SGE
unresponsice" or cimilar summary)
2. about this Bernard already asked an the mailing list I think. I'm still
in the opinion that this is not an array job issue, however
Chances are very low that we are going to fix something for 5.3. All
scalability/performance related improvments will be done in 6.x.
(deletion of array jobs is still not super efficient on 6.0 of many tasks
are running - this is a known issue, I think it's also documented in of
the the qdel related issues in Issuezilla).
> I have noticed that if there are a huge number of jobs exiting from SGE,
> sge_commd will get hung and there is no way for it to recover.
> When users submit a huge number of array jobs (for example 30,000 tasks)
> and the jobs either finish really quickly or they are deleted in one
> shot, then bad things happen.
> We are using 5.3p6 and have stopped using array jobs ever since
> (sticking with just simple jobs).
> We have local installation of SGE on each node.
>> -----Original Message-----
>> From: Sean Dilda [mailto:agrajag at dragaera.net]
>> Sent: Thursday, September 16, 2004 14:22
>> To: users at gridengine.sunsource.net
>> Subject: [GE users] slow responses when large jobs finish
>> Has anyone else noticed slow responses from SGE commands when
>> large jobs are finishing? I had a user just delete about 4
>> running jobs, each one was taking up 30 slots (so 120 slots
>> total). For a few minutes afterwords, sge_qmaster was
>> effectively unresponsive. Commands like 'qstat' would just
>> sit there until sge_qmaster becomes responsive again. I've
>> noticed this kinda of behavior before, but it was especially
>> bad this time. Has anyone else noticed anything of this sort?
>> I'm running 6.0u1 with classic spooling (and sge_qmaster's
>> spool is over
>> nfs). I ran strace and it seemed that one of the sge_qmaster threads
>> was busy doing a lot of file I/O related to the jobs that
>> were finishing. This surprises me somewhat as I thought
>> making sge_qmaster threaded was supposed to help with
>> situations like this. I understand that using NFS will slow
>> things down somewhat, but I can't imagine that 120 slots
>> worth of jobs would cause enough file I/O that sge_qmaster
>> would become effectively unresponsive for several minutes.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users