[GE users] SGE stalls with large queues
hoover at deadmoose.com
Wed May 12 18:55:36 BST 2010
We have had a problem where SGE quits working when we had lots of jobs in the queue (lots being in the 30000 range). With older versions, we notice the sge master process growing till it hit 12GiB or more, then crashing.
We installed 6.2u5 in hopes that it would fix this. When I tested it yesterday, it worked much better (process only grew to 1-2GiB or so), but it still would periodically hang up such that other request, qmon, etc. would be unable to communicate with it. All of the jobs had the same name assigned to them, and a qdel name would really push things into this state.
The frontend is running on a dual 4-core AMD 2GHz processor system, with 8GiB of RAM. It looked like it never got above about 120% of a CPU (I guess is uses some threading now).
We are using classic spooling to an NFS server, since we have things set up with a shadow in case of failover. There did not appear to be too much I/O on that server.
All of this is running under CentOS 5.4 64 bit mode.
Any tips or comments on experiences that others have would be appreciated. I really would like to be able to queue up hundreds of thousands of jobs, but so far SGE doesn't seem to deal well with this. (I know I might be able to help some using array jobs, and we'll probably see about that, but not everything would fit into that mode).
By the way, I do have the accounting and scheduling info disabled.
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users