[GE users] Scaling up GE for huge number of jobs
Gary L Fox
garylfox at hotmail.com
Wed Jan 2 20:23:57 GMT 2008
[ The following text is in the "iso-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
I have a Linux cluster that is running RH4update 4 across all nodes (about 70 nodes total).
We have SGE 6.0u10 running and have had very little problems for quite a while.
However, our users have recently added a new type of job they run and they run these new jobs by the tens of thousands at a time.
Currently, the queue contains 160K jobs.
Well needless to say, things seem to be running in slow motion now. The scheduler is running at around 100% CPU constantly.
We were not getting any meaningful response in qmon and to qsub and qstat commands, so I restarted SGE. I increased the schedule_interval from 15secs to 2 mins. Between the restart and the increased interval, things seem to be working better, as we can now get a response from qmon and qstat and we can submit jobs too. But things are still very much like slow motion.
The cluster does not seem to remain full with jobs. Some nodes have only one job running and a few even have no jobs. (each node is 2CPU and normally would have 2 jobs running).
We also have noticed that jobs from different users do not balance out (through fair share) as they have in the past. Newly submitted jobs remain at the bottom of the queue with a priority of 0.0000. Earlier queued jobs from another user have a priority around 0.55 to 0.56.
I have always had reservations turned off with max_reservation=0. I have the default value for max_functional_jobs_to_schedule set to 200. I also just changed maxujobs to 136 from a value of 0.
What can I do to optimize the settings for this scenario and get better utilization?
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users