[GE users] Any job suspension gotchas?

Marconnet, James E Mr /Computer Sciences Corporation james.marconnet at smdc.army.mil
Thu Apr 21 19:44:15 BST 2005

Using 6.0u3. Many similar nodes in two CPU speeds, some single-processor and
some dual-processor hyperthreaded; Several different amounts of RAM; Many
users. No parallel jobs, no checkpointing, no special software licenses, or
other special requirements that I'm aware of. Using que subordination to
prevent node oversubscribing.

I had resisted using job suspension, but am now thinking this may be a good
tool to permit better overall utilization of our cluster while still giving
higher priority users a way for their jobs to start right away instead of
having to wait for someone else's previously-started but lower priority jobs
to finish. 

What we've been doing in a multiple group multiple deadlines crunch-time
lately is to dedicate certain nodes to specific groups. That gave them
instant access to all "their" nodes all the time, but it basically
guaranteed that half or more of our nodes were sitting idle most of the time
when another user running multiple jobs might have been able to use those
nodes if we were using job suspension. Bummer.

Since I don't know which parts of SGE 6.0u3 are working great and which are
problematic, I thought it worth asking if this part of SGE works well, or if
it causes problems. 

I understand that job suspension does not free up the memory, but from what
I see so far, we have enough memory on each node for that to not be a
particular problem if just one job gets suspended. Or SGE appears to have
the flexibility in the settings to control which nodes with what RAM suspend
jobs or not.

Perhaps data latency is a problem, if suspended jobs have files open for
writing for a long time while they are suspended? 

Anything special to be on the lookout for or to consider before implementing
job suspension?

Thanks in advance!
Jim Marconnet

More information about the gridengine-users mailing list