[GE users] Suspend/Resume with MPICH-GM

Andrew Beresford a.j.beresford at sheffield.ac.uk
Wed Mar 29 10:04:06 BST 2006


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hello,

I'm having a problem with jobs running in our MPICH-GM PE.

When I issue a qmod -sj <blah> to grid engine nothing seems to happen.
This only seems to affect our MPICH PE, the jobs running under OpenMP
seem to be fine.

Here's an example of the pstree of the processes running on the workers;

??scsi_eh_0
??sge_execd???sge_shepherd???rshd???qrsh_starter???bash???fluent-run-mep0???fluent???fluent_gmpi.6.2


If I try to stop the job running fluent_gmpi.6.2 by using qmod -sj,
nothing happens.

If I try to send a SIGSTOP to the bash process under qrsh_starter, again
nothing happens.

It only suspends if I send a SIGSTOP to the "fluent_gmpi.6.2".

I'm unsure how SGE suspends processes. Does it just send a SIGSTOP to
the single process at the top, or does it traverse the process tree and
send SIGSTOP to all processess underneath qrsh_starter.

Is there anything I can do to fix this?

Cheers,

Andrew


    [ Part 2, "This is a digitally signed message part" ]
    [ Application/PGP-SIGNATURE (Name: "signature.asc") 198 bytes. ]
    [ Unable to print this part. ]



More information about the gridengine-users mailing list