[GE users] qrsh_starter/shepherd problems?

reuti reuti at staff.uni-marburg.de
Sun Nov 23 17:22:15 GMT 2008


Hi,

Am 21.11.2008 um 16:13 schrieb Rob Naccarato:

> Hi there.
>
> We seem to be having an issue with jobs submitted via qmake->qrsh. In
> particular, we are running the Solexa/Illumina Pipeline, if that  
> helps. All
> machines are 64bit Debian, with custom kernel 2.6.27.
>
> When qmake runs a qrsh, the job goes to another execution node as  
> expected,
> and runs in a process tree like:
>
> root      2941  0.0  0.0  86444  3036 ?        Sl   09:11   0:00  \_
> sge_shepherd-12040 -bg
> usera  2942  0.0  0.0   5448   616 ?        Ss   09:11   0:00      \_
> /oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter
> /var/spool/sge/default/spool/cn3-83/active_jobs/12040.1 noshell
> usera  2952  0.0  0.0   5744  1236 ?        S    09:11    
> 0:00          \_
> /bin/sh -c /oicr/local/analysis/illumina/current/Goat/Matrix/ 
> savinio  -z gzip
> -c 2 s_7_0004_int.txt > Matrix/s_7_0004_02_mat.txt.tmp && mv
> Matrix/s_7_0004_02_mat.txt.t
> usera  2953  2.8  2.1 361492 352356 ?       D    09:11   0:03
> \_ /oicr/local/analysis/illumina/current/Goat/Matrix/savinio -z  
> gzip -c 2
> s_7_0004_int.txt
>
>
> After some time, this job completes, but sge_shepherd does not  
> exit. The
> process tree looks like:
>
> root     28034  0.5  0.0  70396  3776 ?        Sl   Nov20   5:23
> /oicr/cluster/sge6.2/bin/lx24-amd64/sge_execd
> root     28931  0.0  0.0  86380  3072 ?        Sl   Nov20   0:01  \_
> sge_shepherd-10318 -bg
> root      2941  0.0  0.0  86380  3072 ?        Sl   09:11   0:00  \_
> sge_shepherd-12040 -bg
>
> The shepherd never exits. Running an strace on a shepherd reveals:
>
> Process 2941 attached - interrupt to quit
> futex(0x7fff6ebe1470, FUTEX_WAIT, 3, NULL) = -1 EAGAIN (Resource  
> temporarily
> unavailable)
> futex(0x7fff6ebe1470, FUTEX_WAIT, 1, NULL
>
>
> Also, here's the end of the trace file for the shepherd:
>
>
> 11/21/2008 09:11:49 [1084:2942]: args[0] =
> "/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter"
>
> 11/21/2008 09:11:49 [1084:2942]: args[1] =
> "/var/spool/sge/default/spool/cn3-83/active_jobs/12040.1"
>
> 11/21/2008 09:11:49 [1084:2942]: args[2] = "noshell"
>
> 11/21/2008 09:11:49 [1084:2942]:
> execvp(/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter, ...);
> 11/21/2008 09:14:21 [0:2941]: pty_to_commlib: closing pipe to child
> 11/21/2008 09:14:21 [0:2941]: wait3 returned 2942 (status: 0;  
> WIFSIGNALED: 0,
> WIFEXITED: 1, WEXITSTATUS: 0)
> 11/21/2008 09:14:21 [0:2941]: pty_to_commlib: append_to_buf()  
> returned -1,
> errno 0, Success -> exiting
>
> If I restart sge_execd on the execution node, sge considers the job  
> failed.

was there any process jumping out of the process tree, i.e. a fork or  
&, and hanging at the end of the process table?

-- Reuti


>
> Anyone know what's going on here?
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do? 
> dsForumId=38&dsMessageId=89377
>
> To unsubscribe from this discussion, e-mail: [users- 
> unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89599

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list