[GE users] Other documentation for Grid Engine 6 ques setup and maintenance?
Marconnet, James E Mr /Computer Sciences Corporation
james.marconnet at smdc.army.mil
Thu Feb 3 17:49:56 GMT 2005
I very much appreciate this list and the available Sun documentation, but am
looking for a book, website, powerpoint presentation, something explaining
(step by step?) how in general to set up and to maintain (optimize) ques
using N1 Grid Engine 6.
We currently have several hundred Linux nodes of two different CPU
configurations and a dozen scientific users, all wanting to run their
multiple (singles to a hundred-thousand runs!) batch jobs of various
durations from seconds to weeks. So far we have no complications like usage
limits or usage billings, software license limits, limits on which jobs can
be run on which nodes, check-pointing, or memory or disk space limits.
We users are all friends, and we want to keep it that way. But we all have
our own work to get done. We want to keep it somewhat simple using several
different (primary and secondary) ques that the users would routinely submit
to, rather than having a lot of job-specific user input required when
submitting each individual job. Ideally just pick your que based on who you
are and whether your job is long- or short-running. And perhaps have a
lower-priority que for the jobs you want to run, but when you are not up
against a deadline.
We have the aforementioned ques initially set up for each of two
organizations (even and odd nodes, by organization) and split into primary
(0 nice) and secondary (20 nice) ques. It runs! We users realize as we
watch it operate (using qstat and qmon) that we need some fine-tuning to
prevent too many jobs running at a time on individual nodes (although I find
it hard to imagine how we could "hurt" the hardware that way!) and to
prevent long-jobs from shutting out the short-jobs from starting, but we are
not sure exactly what that tuning should be.
I am especially struggling with understanding and the interations between
"nice", "priorities", "load_thresholds", "subordinate list", and
"over-subscribing". These are all in the Sun documentation and have been
discussed here at length, but so far I've not been able to put it all
together into something coherent.
And I keep seeing references to things like: sched-conf(5) manpage, but have
no idea what they are talking about or where to find this, much less how to
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users