[GE users] a few questions...
justin.ottley at gmail.com
Fri Jan 29 16:10:25 GMT 2010
Las year I deployed an SGE 6.1u4 cluster for our renderfarm (animation
and visual effects studio). We run array jobs of a bunch of the usual
suspects - maya, houdini, prman, fusion, nuke, shake, etc.
> hello all,
> this is my first post to this list. i've just installed sge6.2u5 - which i'll be using to calculate simulations, render animation and other animation related batch jobs. so far so good - sge may be ugly on the (gui) surface - but i do appreciate the beauty and power beneath. ie, the parallel environment is great for controlling cpu slot allocation, as are reporting options and user/queue config options. the choice is almost daunting at first!
> i've a few simple questions if thats ok:
> 1/ is it possible to restart a task of an array job (ie, a render frame)? ie, if a machine bugs out - i just want to reassign the failed frame again rather than rerun the complete job. so far qresub only works at the job granularity from what i can see...
yes, take a look at qmod.
Particularly, qmod -rj (or -r or -rq).
> 2/ are there any mature gui's/front ends that i can leverage? i see that there's a xml output and a java gui (xml-qstat). before i install any more new software - it would be great if anyone could give me a quick heads up on status.
I wrote a custom job view GUI for our cluster, since production required
more sophisticated (workflow specific) features than I found with qmon
Id be interested to hear about anything you find though!
> 3/ i've also read with great interest at ganglia and other monitoring software. again - are there any up-to date summaries? i'm even thinking maybe i could use this as a first stab at job monitoring.. (as well as grid health monitor)
We use ARco here (although ganglia looks very nice), and we are pretty
happy with it so far. We use it for data and graphs for frame render
times, license usage, slot usage, blade health, server load, etc. They
are all handcrafted queries since we have a particular workflow for our
jobs (just some simple tricks to get job data into the database) but
nothing really crazy. There are a couple UI things I would like to
improve, but all in all its been good.
> 4/ are there many people using sge within the animation context? if so, are there any specific mailing lists that i should be on?
Ive seen one or two people on this list (besides you), but im not sure..
if you do find any specific mailing lists id be interested to know!
> i'm also keen to hear any good/bad experiences before i leap in with both feet first.
As for the rest of our installation, roughly:
- SGE 6.1u4
- BDB RPC
- local execd spool
~ 800 CPUs up (linux, OS X, windows)
checking some stats now, we run somewhere between 10,000 - 25,000 array
jobs a day.
Our qmaster server is not the "latest" hardware by any means, and runs
at < 0.5 np_load_avg, same with our ARco server. Our BDB server load is
I chose BDB-RPC for qmaster failover, which has come in handy in the
past. Of course the BDB single point isnt ideal.. I had a prototype
solution for BDB replication/failover but never saw it through far
enough to know if it worked in practice (so for now I just protect the
bdb server from the outside world).
I have a few small peeves about the XML output, for example 6.1u4 qstat
-j doesnt give you a state code for the job (r, Rr, qw, hqw, etc),
invalid XML in some edge cases (later versions are better afaik)...
> many thanks in advance. regards,
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users