[GE users] Any job suspension gotchas?

Rayson Ho raysonho at eseenet.com
Fri Apr 22 23:34:45 BST 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

> I tried it on a test que and got bizarre statuses like Tr.

Should be fine. T => Threshold (see qstat(1)

There are howtos on setting up shared trees:

http://gridengine.sunsource.net/howto/geee.html
http://www.sun.com/blueprints/0703/817-3179.pdf

And remember that "Enterprise Edition" is merged with the normal SGE mode,
so docs on SGEEE in 5.3 should apply for SGE 6.0.

Rayson


> The jobs turned weird colors in qmon.
>I could kill some jobs, expecting the others to be assigned as I expected
>them to be. No such luck. For a while it looked like qmaster had died.
Then
>it came back to life. Live and learn!
> 
>Clearly I need to dive into share-trees, etc. etc. I was hoping to not
have
>to do that, but sometimes one must....
> 
>Jim
>
>  _____  
>
>From: Marconnet, James E Mr /Computer Sciences Corporation
>[mailto:james.marconnet at smdc.army.mil] 
>Sent: Thursday, April 21, 2005 1:44 PM
>To: users at gridengine.sunsource.net
>Subject: [GE users] Any job suspension gotchas?
>
>
>
>Using 6.0u3. Many similar nodes in two CPU speeds, some single-processor
and
>some dual-processor hyperthreaded; Several different amounts of RAM; Many
>users. No parallel jobs, no checkpointing, no special software licenses,
or
>other special requirements that I'm aware of. Using que subordination to
>prevent node oversubscribing.
>
>I had resisted using job suspension, but am now thinking this may be a
good
>tool to permit better overall utilization of our cluster while still
giving
>higher priority users a way for their jobs to start right away instead of
>having to wait for someone else's previously-started but lower priority
jobs
>to finish. 
>
>What we've been doing in a multiple group multiple deadlines crunch-time
>lately is to dedicate certain nodes to specific groups. That gave them
>instant access to all "their" nodes all the time, but it basically
>guaranteed that half or more of our nodes were sitting idle most of the
time
>when another user running multiple jobs might have been able to use those
>nodes if we were using job suspension. Bummer.
>
>Since I don't know which parts of SGE 6.0u3 are working great and which
are
>problematic, I thought it worth asking if this part of SGE works well, or
if
>it causes problems. 
>
>I understand that job suspension does not free up the memory, but from
what
>I see so far, we have enough memory on each node for that to not be a
>particular problem if just one job gets suspended. Or SGE appears to have
>the flexibility in the settings to control which nodes with what RAM
suspend
>jobs or not.
>
>Perhaps data latency is a problem, if suspended jobs have files open for
>writing for a long time while they are suspended? 
>
>Anything special to be on the lookout for or to consider before
implementing
>job suspension? 
---------------------------------------------------------
Get your FREE E-mail account at http://www.eseenet.com !

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list