[GE users] a few questions...

reuti reuti at staff.uni-marburg.de
Fri Jan 29 17:56:43 GMT 2010


Hi,

Am 29.01.2010 um 17:10 schrieb juby:

> Hi Paul,
>
> Las year I deployed an SGE 6.1u4 cluster for our renderfarm (animation
> and visual effects studio). We run array jobs of a bunch of the usual
> suspects - maya, houdini, prman, fusion, nuke, shake, etc.
>
> paul_simpson wrote:
>> hello all,
>>
>> this is my first post to this list.  i've just installed sge6.2u5  
>> - which i'll be using to calculate simulations, render animation  
>> and other animation related batch jobs.  so far so good - sge may  
>> be ugly on the (gui) surface - but i do appreciate the beauty and  
>> power beneath.  ie, the parallel environment is great for  
>> controlling cpu slot allocation, as are reporting options and user/ 
>> queue config options.  the choice is almost daunting at first!
>>
>> i've a few simple questions if thats ok:
>> 1/ is it possible to restart a task of an array job (ie, a render  
>> frame)?  ie, if a machine bugs out - i just want to reassign the  
>> failed frame again rather than rerun the complete job.  so far  
>> qresub only works at the job granularity from what i can see...
>>
> yes, take a look at qmod.
> Particularly, qmod -rj (or -r or -rq).

yep, but only when the task is still in the system. Once it left it,  
it can't be rescheduled this way although other tasks of this array  
jobs are still in the system as a template.

I think, you could enter an RFE to allow the -t option to qresub. If  
no -t option is specified, use the one which was used in the original  
qsub, otherwise the specified one. So, instead of specifying the the  
task id to the jobnumber, you would need to apply -t with the correct  
number.


>> 2/ are there any mature gui's/front ends that i can leverage?  i  
>> see that there's a xml output and a java gui (xml-qstat).  before  
>> i install any more new software - it would be great if anyone  
>> could give me a quick heads up on status.
>>
> I wrote a custom job view GUI for our cluster, since production  
> required
> more sophisticated (workflow specific) features than I found with qmon
> or xml-qstat.
> Id be interested to hear about anything you find though!

Which information do you need from the gui's/front ends in detail?

-- Reuti


>> 3/ i've also read with great interest at ganglia and other  
>> monitoring software.  again - are there any up-to date summaries?   
>> i'm even thinking maybe i could use this as a first stab at job  
>> monitoring.. (as well as grid health monitor)
>>
> We use ARco here (although ganglia looks very nice), and we are pretty
> happy with it so far. We use it for data and graphs for frame render
> times, license usage, slot usage, blade health, server load, etc. They
> are all handcrafted queries since we have a particular workflow for  
> our
> jobs (just some simple tricks to get job data into the database) but
> nothing really crazy. There are a couple UI things I would like to
> improve, but all in all its been good.
>> 4/ are there many people using sge within the animation context?   
>> if so, are there any specific mailing lists that i should be on?
> Ive seen one or two people on this list (besides you), but im not  
> sure..
> if you do find any specific mailing lists id be interested to know!
>> i'm also keen to hear any good/bad experiences before i leap in  
>> with both feet first.
>>
> As for the rest of our installation, roughly:
> - SGE 6.1u4
> - BDB RPC
> - ARCo
> - local execd spool
> ~ 800 CPUs up (linux, OS X, windows)
>
> checking some stats now, we run somewhere between 10,000 - 25,000  
> array
> jobs a day.
>
> Our qmaster server is not the "latest" hardware by any means, and runs
> at < 0.5 np_load_avg, same with our ARco server. Our BDB server  
> load is
> negligible.
> I chose BDB-RPC for qmaster failover, which has come in handy in the
> past. Of course the BDB single point isnt ideal.. I had a prototype
> solution for BDB replication/failover but never saw it through far
> enough to know if it worked in practice (so for now I just protect the
> bdb server from the outside world).
>
> I have a few small peeves about the XML output, for example 6.1u4  
> qstat
> -j doesnt give you a state code for the job (r, Rr, qw, hqw, etc),
> invalid XML in some edge cases (later versions are better afaik)...
>
> -justin
>
>> many thanks in advance.  regards,
>>
>> paul
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do? 
>> dsForumId=38&dsMessageId=241633
>>
>> To unsubscribe from this discussion, e-mail: [users- 
>> unsubscribe at gridengine.sunsource.net].
>>
>>
>>
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=241756
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=241775

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list