[GE users] scalability problems

Andy Schwierskott andy.schwierskott at sun.com
Tue Apr 6 09:02:48 BST 2004


> >>When there are more than about 5000 jobs pending, our qmaster loses contact
> >>with all the exec hosts (about 100 hosts---all queues are shown as
> >>"temporarily unavailable"), and no more jobs are dispatched.  The qmaster host
> >>spends about 80% cpu time in iowait---probably disk io. since there's not much
> >>network traffic.
> >>
> >>We are in the process of switching from LSF.  The same host is also the LSF
> >>master, although the LSF queues are empty and there are only a handful of
> >>slaves still running.  When SGE "hangs", I can still submit and run an LSF job
> >>without delay, so the basics of the machine are ok.
> >>
> >>What can I tune to help this?  I read about max_jobs, but is that even in
> >>SGEEE v5.3p4?  Also, I don't really want submits to fail, since that would be
> >>perceived by users as a limitation compared to LSF.
> >>
> >>We routinely have ~20,000 jobs pending.  Help appreciated!
> >
> >
> > I had similar problems to yours.  My setup is a lot different that
> > yours, but I'm wondering if your problem might be related to mine.  If
> > you read my recent post
> > (http://gridengine.sunsource.net/servlets/ReadMsg?msgId=17237&listName=users), you'll see that I used strace to determine that sge_qmaster was doing a lot of disk I/O in its queue that it didn't need to be doing.  In my case it was over NFS, which made it even worse.
> >
> > When I fixed my spool files like Andy indicated, it significantly
> > reduced the load sge_qmaster was producing.  It reduced it enough that
> > SGE seems to no longer be dropping nodes.  (I had a test case that could
> > reliably get SGE to drop random nodes)
> >
> > I don't know if you're seeing the same problem or not, but I think it
> > might be worth looking into.
> Interesting.  I looked at exec_hosts the other day and about five hosts had
> reschedule_unknown != NONE (low job numbers).  I meant to try the fix just
> now, but all the files have fixed themselves without shutting down qmaster.

And do you have a lower I/O wait of the qmaster process now?


To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list