[GE users] qrsh_starter/shepherd problems?

Rob Naccarato rob.naccarato at oicr.on.ca
Fri Nov 21 15:13:13 GMT 2008


Hi there.

We seem to be having an issue with jobs submitted via qmake->qrsh. In
particular, we are running the Solexa/Illumina Pipeline, if that helps. All
machines are 64bit Debian, with custom kernel 2.6.27.

When qmake runs a qrsh, the job goes to another execution node as expected,
and runs in a process tree like:

root      2941  0.0  0.0  86444  3036 ?        Sl   09:11   0:00  \_
sge_shepherd-12040 -bg
usera  2942  0.0  0.0   5448   616 ?        Ss   09:11   0:00      \_
/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter
/var/spool/sge/default/spool/cn3-83/active_jobs/12040.1 noshell
usera  2952  0.0  0.0   5744  1236 ?        S    09:11   0:00          \_
/bin/sh -c /oicr/local/analysis/illumina/current/Goat/Matrix/savinio  -z gzip
-c 2 s_7_0004_int.txt > Matrix/s_7_0004_02_mat.txt.tmp && mv
Matrix/s_7_0004_02_mat.txt.t
usera  2953  2.8  2.1 361492 352356 ?       D    09:11   0:03
\_ /oicr/local/analysis/illumina/current/Goat/Matrix/savinio -z gzip -c 2
s_7_0004_int.txt


After some time, this job completes, but sge_shepherd does not exit. The
process tree looks like:

root     28034  0.5  0.0  70396  3776 ?        Sl   Nov20   5:23
/oicr/cluster/sge6.2/bin/lx24-amd64/sge_execd
root     28931  0.0  0.0  86380  3072 ?        Sl   Nov20   0:01  \_
sge_shepherd-10318 -bg
root      2941  0.0  0.0  86380  3072 ?        Sl   09:11   0:00  \_
sge_shepherd-12040 -bg

The shepherd never exits. Running an strace on a shepherd reveals:

Process 2941 attached - interrupt to quit
futex(0x7fff6ebe1470, FUTEX_WAIT, 3, NULL) = -1 EAGAIN (Resource temporarily
unavailable)
futex(0x7fff6ebe1470, FUTEX_WAIT, 1, NULL


Also, here's the end of the trace file for the shepherd:


11/21/2008 09:11:49 [1084:2942]: args[0] =
"/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter"

11/21/2008 09:11:49 [1084:2942]: args[1] =
"/var/spool/sge/default/spool/cn3-83/active_jobs/12040.1"

11/21/2008 09:11:49 [1084:2942]: args[2] = "noshell"

11/21/2008 09:11:49 [1084:2942]:
execvp(/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter, ...);
11/21/2008 09:14:21 [0:2941]: pty_to_commlib: closing pipe to child
11/21/2008 09:14:21 [0:2941]: wait3 returned 2942 (status: 0; WIFSIGNALED: 0,
WIFEXITED: 1, WEXITSTATUS: 0)
11/21/2008 09:14:21 [0:2941]: pty_to_commlib: append_to_buf() returned -1,
errno 0, Success -> exiting

If I restart sge_execd on the execution node, sge considers the job failed.

Anyone know what's going on here?

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89377

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list