[GE users] Any job suspension gotchas?

Marconnet, James E Mr /Computer Sciences Corporation james.marconnet at smdc.army.mil
Fri Apr 22 22:19:29 BST 2005

I love the silence you hear in this users group when you say/ask something
completely ignorant like I just did. 
What I was thinking of would not work at all for us. I tried it on a test
que and got bizarre statuses like Tr. The jobs turned weird colors in qmon.
I could kill some jobs, expecting the others to be assigned as I expected
them to be. No such luck. For a while it looked like qmaster had died. Then
it came back to life. Live and learn!
Clearly I need to dive into share-trees, etc. etc. I was hoping to not have
to do that, but sometimes one must....


From: Marconnet, James E Mr /Computer Sciences Corporation
[mailto:james.marconnet at smdc.army.mil] 
Sent: Thursday, April 21, 2005 1:44 PM
To: users at gridengine.sunsource.net
Subject: [GE users] Any job suspension gotchas?

Using 6.0u3. Many similar nodes in two CPU speeds, some single-processor and
some dual-processor hyperthreaded; Several different amounts of RAM; Many
users. No parallel jobs, no checkpointing, no special software licenses, or
other special requirements that I'm aware of. Using que subordination to
prevent node oversubscribing.

I had resisted using job suspension, but am now thinking this may be a good
tool to permit better overall utilization of our cluster while still giving
higher priority users a way for their jobs to start right away instead of
having to wait for someone else's previously-started but lower priority jobs
to finish. 

What we've been doing in a multiple group multiple deadlines crunch-time
lately is to dedicate certain nodes to specific groups. That gave them
instant access to all "their" nodes all the time, but it basically
guaranteed that half or more of our nodes were sitting idle most of the time
when another user running multiple jobs might have been able to use those
nodes if we were using job suspension. Bummer.

Since I don't know which parts of SGE 6.0u3 are working great and which are
problematic, I thought it worth asking if this part of SGE works well, or if
it causes problems. 

I understand that job suspension does not free up the memory, but from what
I see so far, we have enough memory on each node for that to not be a
particular problem if just one job gets suspended. Or SGE appears to have
the flexibility in the settings to control which nodes with what RAM suspend
jobs or not.

Perhaps data latency is a problem, if suspended jobs have files open for
writing for a long time while they are suspended? 

Anything special to be on the lookout for or to consider before implementing
job suspension? 

Thanks in advance! 
Jim Marconnet 

More information about the gridengine-users mailing list