[GE users] qrsh_starter/shepherd problems?

reuti reuti at staff.uni-marburg.de
Mon Nov 24 13:43:16 GMT 2008


Am 24.11.2008 um 13:58 schrieb Rob Naccarato:

> On Sun, Nov 23, 2008 at 12:22:15PM -0500, reuti wrote:
>> Hi,
>>
>> Am 21.11.2008 um 16:13 schrieb Rob Naccarato:
>>
>>> Hi there.
>>>
>>> We seem to be having an issue with jobs submitted via qmake- 
>>> >qrsh. In
>>> particular, we are running the Solexa/Illumina Pipeline, if that
>>> helps. All
>>> machines are 64bit Debian, with custom kernel 2.6.27.
>>>
>>> When qmake runs a qrsh, the job goes to another execution node as
>>> expected,
>>> and runs in a process tree like:
>>>
>>> root      2941  0.0  0.0  86444  3036 ?        Sl   09:11   0:00  \_
>>> sge_shepherd-12040 -bg
>>> usera  2942  0.0  0.0   5448   616 ?        Ss   09:11    
>>> 0:00      \_
>>> /oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter
>>> /var/spool/sge/default/spool/cn3-83/active_jobs/12040.1 noshell
>>> usera  2952  0.0  0.0   5744  1236 ?        S    09:11
>>> 0:00          \_
>>> /bin/sh -c /oicr/local/analysis/illumina/current/Goat/Matrix/
>>> savinio  -z gzip
>>> -c 2 s_7_0004_int.txt > Matrix/s_7_0004_02_mat.txt.tmp && mv
>>> Matrix/s_7_0004_02_mat.txt.t
>>> usera  2953  2.8  2.1 361492 352356 ?       D    09:11   0:03
>>> \_ /oicr/local/analysis/illumina/current/Goat/Matrix/savinio -z
>>> gzip -c 2
>>> s_7_0004_int.txt
>>>
>>>
>>> After some time, this job completes, but sge_shepherd does not
>>> exit. The
>>> process tree looks like:
>>>
>>> root     28034  0.5  0.0  70396  3776 ?        Sl   Nov20   5:23
>>> /oicr/cluster/sge6.2/bin/lx24-amd64/sge_execd
>>> root     28931  0.0  0.0  86380  3072 ?        Sl   Nov20   0:01  \_
>>> sge_shepherd-10318 -bg
>>> root      2941  0.0  0.0  86380  3072 ?        Sl   09:11   0:00  \_
>>> sge_shepherd-12040 -bg
>>>
>>> The shepherd never exits. Running an strace on a shepherd reveals:
>>>
>>> Process 2941 attached - interrupt to quit
>>> futex(0x7fff6ebe1470, FUTEX_WAIT, 3, NULL) = -1 EAGAIN (Resource
>>> temporarily
>>> unavailable)
>>> futex(0x7fff6ebe1470, FUTEX_WAIT, 1, NULL
>>>
>>>
>>> Also, here's the end of the trace file for the shepherd:
>>>
>>>
>>> 11/21/2008 09:11:49 [1084:2942]: args[0] =
>>> "/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter"
>>>
>>> 11/21/2008 09:11:49 [1084:2942]: args[1] =
>>> "/var/spool/sge/default/spool/cn3-83/active_jobs/12040.1"
>>>
>>> 11/21/2008 09:11:49 [1084:2942]: args[2] = "noshell"
>>>
>>> 11/21/2008 09:11:49 [1084:2942]:
>>> execvp(/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter, ...);
>>> 11/21/2008 09:14:21 [0:2941]: pty_to_commlib: closing pipe to child
>>> 11/21/2008 09:14:21 [0:2941]: wait3 returned 2942 (status: 0;
>>> WIFSIGNALED: 0,
>>> WIFEXITED: 1, WEXITSTATUS: 0)
>>> 11/21/2008 09:14:21 [0:2941]: pty_to_commlib: append_to_buf()
>>> returned -1,
>>> errno 0, Success -> exiting
>>>
>>> If I restart sge_execd on the execution node, sge considers the job
>>> failed.
>>
>> was there any process jumping out of the process tree, i.e. a fork or
>> &, and hanging at the end of the process table?
>
> No. The processes did not appear to detach from the calling shepherd.
>
> I did manage to get a workaround by using ssh as our transport.  
> Under that,
> the shepherds terminate properly.

Unless you compiled SGE on your own a flag for tight ssh support, you  
will have no Tight Integration of the slave tasks. Means: no correct  
accounting for cpu/mem/io.

As you are using 6.2, this would mean that the new built-in qrsh  
support is the cause of it. Can you please check, whether the  
traditional startup with rsh is also working or not working to get to  
the root of it:

qlogin_command               /usr/bin/telnet
qlogin_daemon                /usr/sbin/in.telnetd
rlogin_command               /usr/sge/utilbin/lx24-amd64/rlogin
rlogin_daemon                /usr/sbin/in.rlogind
rsh_command                  /usr/sge/utilbin/lx24-amd64/rsh
rsh_daemon                   /usr/sge/utilbin/lx24-amd64/rshd -l

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89685

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list