[GE users] scalability problems
andy.schwierskott at sun.com
Tue Apr 6 09:02:48 BST 2004
> >>When there are more than about 5000 jobs pending, our qmaster loses contact
> >>with all the exec hosts (about 100 hosts---all queues are shown as
> >>"temporarily unavailable"), and no more jobs are dispatched. The qmaster host
> >>spends about 80% cpu time in iowait---probably disk io. since there's not much
> >>network traffic.
> >>We are in the process of switching from LSF. The same host is also the LSF
> >>master, although the LSF queues are empty and there are only a handful of
> >>slaves still running. When SGE "hangs", I can still submit and run an LSF job
> >>without delay, so the basics of the machine are ok.
> >>What can I tune to help this? I read about max_jobs, but is that even in
> >>SGEEE v5.3p4? Also, I don't really want submits to fail, since that would be
> >>perceived by users as a limitation compared to LSF.
> >>We routinely have ~20,000 jobs pending. Help appreciated!
> > I had similar problems to yours. My setup is a lot different that
> > yours, but I'm wondering if your problem might be related to mine. If
> > you read my recent post
> > (http://gridengine.sunsource.net/servlets/ReadMsg?msgId=17237&listName=users), you'll see that I used strace to determine that sge_qmaster was doing a lot of disk I/O in its queue that it didn't need to be doing. In my case it was over NFS, which made it even worse.
> > When I fixed my spool files like Andy indicated, it significantly
> > reduced the load sge_qmaster was producing. It reduced it enough that
> > SGE seems to no longer be dropping nodes. (I had a test case that could
> > reliably get SGE to drop random nodes)
> > I don't know if you're seeing the same problem or not, but I think it
> > might be worth looking into.
> Interesting. I looked at exec_hosts the other day and about five hosts had
> reschedule_unknown != NONE (low job numbers). I meant to try the fix just
> now, but all the files have fixed themselves without shutting down qmaster.
And do you have a lower I/O wait of the qmaster process now?
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users