[GE users] a few questions...

juby justin.ottley at gmail.com
Fri Jan 29 20:22:05 GMT 2010

Hey -

reuti wrote:
> Hi,
> Am 29.01.2010 um 17:10 schrieb juby:
>> Hi Paul,
>> Las year I deployed an SGE 6.1u4 cluster for our renderfarm (animation
>> and visual effects studio). We run array jobs of a bunch of the usual
>> suspects - maya, houdini, prman, fusion, nuke, shake, etc.
>> paul_simpson wrote:
>>> hello all,
>>> this is my first post to this list.  i've just installed sge6.2u5  
>>> - which i'll be using to calculate simulations, render animation  
>>> and other animation related batch jobs.  so far so good - sge may  
>>> be ugly on the (gui) surface - but i do appreciate the beauty and  
>>> power beneath.  ie, the parallel environment is great for  
>>> controlling cpu slot allocation, as are reporting options and user/ 
>>> queue config options.  the choice is almost daunting at first!
>>> i've a few simple questions if thats ok:
>>> 1/ is it possible to restart a task of an array job (ie, a render  
>>> frame)?  ie, if a machine bugs out - i just want to reassign the  
>>> failed frame again rather than rerun the complete job.  so far  
>>> qresub only works at the job granularity from what i can see...
>> yes, take a look at qmod.
>> Particularly, qmod -rj (or -r or -rq).
> yep, but only when the task is still in the system. Once it left it,  
> it can't be rescheduled this way although other tasks of this array  
> jobs are still in the system as a template.
Yes, thats a good point, thanks for clarifying that :)
Also, I suppose its worth mentioning that using reschedule_unknown is 
another potential option.
> I think, you could enter an RFE to allow the -t option to qresub. If  
> no -t option is specified, use the one which was used in the original  
> qsub, otherwise the specified one. So, instead of specifying the the  
> task id to the jobnumber, you would need to apply -t with the correct  
> number.
>>> 2/ are there any mature gui's/front ends that i can leverage?  i  
>>> see that there's a xml output and a java gui (xml-qstat).  before  
>>> i install any more new software - it would be great if anyone  
>>> could give me a quick heads up on status.
>> I wrote a custom job view GUI for our cluster, since production  
>> required
>> more sophisticated (workflow specific) features than I found with qmon
>> or xml-qstat.
>> Id be interested to hear about anything you find though!
> Which information do you need from the gui's/front ends in detail?
> -- Reuti
Well IMHO its a matter of both the information users need and the 
presentation of the information..

Information wise, the gui shows workflow-specific information that jobs 
carry such as "shot id", "job type", "frame range", "order", "scene", 
"task", (etc) that are extracted from the job and presented in columns, 
then there is progress information that is interpreted for job progress 
bars, there is other frame information that is interpreted to show what 
frame range a job (or job-task) is running, then there is stuff like 
queue time, run time, priority (a workflow specific interpretation of 
the qsub -p value).. There is colour-coding for job states (running, 
rerunning, queued, delayed, error, killed, completed, etc)..

Also a very important part of the gui is transparent use of the database 
(that is used behind ARco), for example for job history (jobs 
themselves, finding paths to logs, job run times, etc) and user history; 
another critical part is display of job logs; we have a relatively 
sophisticated system of logging that does annotation on the job output, 
which allows detection of known problems in a render and differentition 
of "important" output vs "debug" output vs <whatever> output... so via 
the gui, the job logs are filterable, paginated, known errors are 
highlighted, can be streamed in real-time as the job is running (w/o NFS 
usage).. these features are very useful for debugging renders (but again 
are site specific).

Presentation wise, not sure i can describe in detail at the moment, but 
there are some sophisticated widgets such as a separate torrent-style 
progress summary for the whole array job (mouse sensitive.. its 
pretty... artist like it..). Without something like a screenshot it 
might not be so easy to identify what is going on vs something like 
xml-qstat, but i think its hard to compare a workflow specific solution 
(intended for a certain kind of user) vs. something intended to be 
workflow independent (like xml-qstat, which is great in itself, and 
qmon, which we still use heavily for maintainance and administration) 
since they are kind of solving different problems and have a different 
set of requirements.


>>> 3/ i've also read with great interest at ganglia and other  
>>> monitoring software.  again - are there any up-to date summaries?   
>>> i'm even thinking maybe i could use this as a first stab at job  
>>> monitoring.. (as well as grid health monitor)
>> We use ARco here (although ganglia looks very nice), and we are pretty
>> happy with it so far. We use it for data and graphs for frame render
>> times, license usage, slot usage, blade health, server load, etc. They
>> are all handcrafted queries since we have a particular workflow for  
>> our
>> jobs (just some simple tricks to get job data into the database) but
>> nothing really crazy. There are a couple UI things I would like to
>> improve, but all in all its been good.
>>> 4/ are there many people using sge within the animation context?   
>>> if so, are there any specific mailing lists that i should be on?
>> Ive seen one or two people on this list (besides you), but im not  
>> sure..
>> if you do find any specific mailing lists id be interested to know!
>>> i'm also keen to hear any good/bad experiences before i leap in  
>>> with both feet first.
>> As for the rest of our installation, roughly:
>> - SGE 6.1u4
>> - BDB RPC
>> - ARCo
>> - local execd spool
>> ~ 800 CPUs up (linux, OS X, windows)
>> checking some stats now, we run somewhere between 10,000 - 25,000  
>> array
>> jobs a day.
>> Our qmaster server is not the "latest" hardware by any means, and runs
>> at < 0.5 np_load_avg, same with our ARco server. Our BDB server  
>> load is
>> negligible.
>> I chose BDB-RPC for qmaster failover, which has come in handy in the
>> past. Of course the BDB single point isnt ideal.. I had a prototype
>> solution for BDB replication/failover but never saw it through far
>> enough to know if it worked in practice (so for now I just protect the
>> bdb server from the outside world).
>> I have a few small peeves about the XML output, for example 6.1u4  
>> qstat
>> -j doesnt give you a state code for the job (r, Rr, qw, hqw, etc),
>> invalid XML in some edge cases (later versions are better afaik)...
>> -justin
>>> many thanks in advance.  regards,
>>> paul
>>> ------------------------------------------------------
>>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>>> dsForumId=38&dsMessageId=241633
>>> To unsubscribe from this discussion, e-mail: [users- 
>>> unsubscribe at gridengine.sunsource.net].
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=241756
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241775
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].

More information about the gridengine-users mailing list