[GE users] Suspend/Resume with MPICH-GM

Andreas Haas Andreas.Haas at Sun.COM
Wed Mar 29 10:35:54 BST 2006

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Andrew,

as pointed out by Reuti, task based suspension is not supported
for parallel jobs. Yet there is a chance to suspend the job as a
whole based on suspend_method/resume_method in queue_conf(5) that
are run on the jobs master node. That means if you can somehow
halt/continue Fluent run e.g. by creating/removing some special file
or something near that, you could implement job based suspension.


On Wed, 29 Mar 2006, Andrew Beresford wrote:

> Hello,
> I'm having a problem with jobs running in our MPICH-GM PE.
> When I issue a qmod -sj <blah> to grid engine nothing seems to happen.
> This only seems to affect our MPICH PE, the jobs running under OpenMP
> seem to be fine.
> Here's an example of the pstree of the processes running on the workers;
> â??â??scsi_eh_0
> â??â??sge_execdâ??â??â??sge_shepherdâ??â??â??rshdâ??â??â??qrsh_starterâ??â??â??bashâ??â??â??fluent-run-mep0â??â??â??fluentâ??â??â??fluent_gmpi.6.2
> If I try to stop the job running fluent_gmpi.6.2 by using qmod -sj,
> nothing happens.
> If I try to send a SIGSTOP to the bash process under qrsh_starter, again
> nothing happens.
> It only suspends if I send a SIGSTOP to the "fluent_gmpi.6.2".
> I'm unsure how SGE suspends processes. Does it just send a SIGSTOP to
> the single process at the top, or does it traverse the process tree and
> send SIGSTOP to all processess underneath qrsh_starter.
> Is there anything I can do to fix this?
> Cheers,
> Andrew

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list