[GE users] Parallel jobs don't terminate

mlelstv mlelstv at serpens.de
Tue Jul 28 21:34:48 BST 2009


On Tue, Jul 28, 2009 at 12:42:01PM +0100, markhewitt wrote:
> I have a problem with users running MPI jobs. Basically everything 
> starts up ok. But for some reason when a job is terminated from SGE 
> (either reaches maximum wallclock time or a user issues qdel). Then it 
> removes the job from the list in SGE but the processes remain running on 
> the nodes. Meaning they quickly become overloaded with orphan processes.
> 
> Any ideas what could be going wrong here?

That's just what is happening. SGE kills the job script. Things
spawned from it, in particular processes on different hosts, are
not affected. There are two solutions to the problem: tight integration
and the notify mechanism.

With tight integration SGE is spawning processes on all hosts of
a job and keeps track of them.

With notification the job script isn't killed directly, it can
catch the termination event and tell the application to shut down
gracefully.

Neither is a simple switch, you job script and possibly you MPI
installation needs to be tailored to it.


Greetings,
-- 
                                Michael van Elst
Internet: mlelstv at serpens.de
                                "A potential Snark may lurk in every tree."

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=209951

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list