[GE users] qrsh_starter/shepherd problems?

Rob Naccarato rob.naccarato at oicr.on.ca
Mon Nov 24 12:58:49 GMT 2008


On Sun, Nov 23, 2008 at 12:22:15PM -0500, reuti wrote:
> Hi,
> 
> Am 21.11.2008 um 16:13 schrieb Rob Naccarato:
> 
> > Hi there.
> >
> > We seem to be having an issue with jobs submitted via qmake->qrsh. In
> > particular, we are running the Solexa/Illumina Pipeline, if that
> > helps. All
> > machines are 64bit Debian, with custom kernel 2.6.27.
> >
> > When qmake runs a qrsh, the job goes to another execution node as
> > expected,
> > and runs in a process tree like:
> >
> > root      2941  0.0  0.0  86444  3036 ?        Sl   09:11   0:00  \_
> > sge_shepherd-12040 -bg
> > usera  2942  0.0  0.0   5448   616 ?        Ss   09:11   0:00      \_
> > /oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter
> > /var/spool/sge/default/spool/cn3-83/active_jobs/12040.1 noshell
> > usera  2952  0.0  0.0   5744  1236 ?        S    09:11
> > 0:00          \_
> > /bin/sh -c /oicr/local/analysis/illumina/current/Goat/Matrix/
> > savinio  -z gzip
> > -c 2 s_7_0004_int.txt > Matrix/s_7_0004_02_mat.txt.tmp && mv
> > Matrix/s_7_0004_02_mat.txt.t
> > usera  2953  2.8  2.1 361492 352356 ?       D    09:11   0:03
> > \_ /oicr/local/analysis/illumina/current/Goat/Matrix/savinio -z
> > gzip -c 2
> > s_7_0004_int.txt
> >
> >
> > After some time, this job completes, but sge_shepherd does not
> > exit. The
> > process tree looks like:
> >
> > root     28034  0.5  0.0  70396  3776 ?        Sl   Nov20   5:23
> > /oicr/cluster/sge6.2/bin/lx24-amd64/sge_execd
> > root     28931  0.0  0.0  86380  3072 ?        Sl   Nov20   0:01  \_
> > sge_shepherd-10318 -bg
> > root      2941  0.0  0.0  86380  3072 ?        Sl   09:11   0:00  \_
> > sge_shepherd-12040 -bg
> >
> > The shepherd never exits. Running an strace on a shepherd reveals:
> >
> > Process 2941 attached - interrupt to quit
> > futex(0x7fff6ebe1470, FUTEX_WAIT, 3, NULL) = -1 EAGAIN (Resource
> > temporarily
> > unavailable)
> > futex(0x7fff6ebe1470, FUTEX_WAIT, 1, NULL
> >
> >
> > Also, here's the end of the trace file for the shepherd:
> >
> >
> > 11/21/2008 09:11:49 [1084:2942]: args[0] =
> > "/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter"
> >
> > 11/21/2008 09:11:49 [1084:2942]: args[1] =
> > "/var/spool/sge/default/spool/cn3-83/active_jobs/12040.1"
> >
> > 11/21/2008 09:11:49 [1084:2942]: args[2] = "noshell"
> >
> > 11/21/2008 09:11:49 [1084:2942]:
> > execvp(/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter, ...);
> > 11/21/2008 09:14:21 [0:2941]: pty_to_commlib: closing pipe to child
> > 11/21/2008 09:14:21 [0:2941]: wait3 returned 2942 (status: 0;
> > WIFSIGNALED: 0,
> > WIFEXITED: 1, WEXITSTATUS: 0)
> > 11/21/2008 09:14:21 [0:2941]: pty_to_commlib: append_to_buf()
> > returned -1,
> > errno 0, Success -> exiting
> >
> > If I restart sge_execd on the execution node, sge considers the job
> > failed.
> 
> was there any process jumping out of the process tree, i.e. a fork or
> &, and hanging at the end of the process table?

No. The processes did not appear to detach from the calling shepherd.

I did manage to get a workaround by using ssh as our transport. Under that,
the shepherds terminate properly.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89676

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list