[GE users] qrsh_starter/shepherd problems?

pollinger harald.pollinger at sun.com
Mon Nov 24 14:46:57 GMT 2008


Please have a look at 
http://gridengine.sunsource.net/issues/show_bug.cgi?id=2775

It seems to be this bug. This bug can only occur with the "builtin" 
mechanism.

Regards,
Harald

On 11/24/08 13:58, Rob Naccarato wrote:
> On Sun, Nov 23, 2008 at 12:22:15PM -0500, reuti wrote:
>> Hi,
>>
>> Am 21.11.2008 um 16:13 schrieb Rob Naccarato:
>>
>>> Hi there.
>>>
>>> We seem to be having an issue with jobs submitted via qmake->qrsh. In
>>> particular, we are running the Solexa/Illumina Pipeline, if that
>>> helps. All
>>> machines are 64bit Debian, with custom kernel 2.6.27.
>>>
>>> When qmake runs a qrsh, the job goes to another execution node as
>>> expected,
>>> and runs in a process tree like:
>>>
>>> root      2941  0.0  0.0  86444  3036 ?        Sl   09:11   0:00  \_
>>> sge_shepherd-12040 -bg
>>> usera  2942  0.0  0.0   5448   616 ?        Ss   09:11   0:00      \_
>>> /oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter
>>> /var/spool/sge/default/spool/cn3-83/active_jobs/12040.1 noshell
>>> usera  2952  0.0  0.0   5744  1236 ?        S    09:11
>>> 0:00          \_
>>> /bin/sh -c /oicr/local/analysis/illumina/current/Goat/Matrix/
>>> savinio  -z gzip
>>> -c 2 s_7_0004_int.txt > Matrix/s_7_0004_02_mat.txt.tmp && mv
>>> Matrix/s_7_0004_02_mat.txt.t
>>> usera  2953  2.8  2.1 361492 352356 ?       D    09:11   0:03
>>> \_ /oicr/local/analysis/illumina/current/Goat/Matrix/savinio -z
>>> gzip -c 2
>>> s_7_0004_int.txt
>>>
>>>
>>> After some time, this job completes, but sge_shepherd does not
>>> exit. The
>>> process tree looks like:
>>>
>>> root     28034  0.5  0.0  70396  3776 ?        Sl   Nov20   5:23
>>> /oicr/cluster/sge6.2/bin/lx24-amd64/sge_execd
>>> root     28931  0.0  0.0  86380  3072 ?        Sl   Nov20   0:01  \_
>>> sge_shepherd-10318 -bg
>>> root      2941  0.0  0.0  86380  3072 ?        Sl   09:11   0:00  \_
>>> sge_shepherd-12040 -bg
>>>
>>> The shepherd never exits. Running an strace on a shepherd reveals:
>>>
>>> Process 2941 attached - interrupt to quit
>>> futex(0x7fff6ebe1470, FUTEX_WAIT, 3, NULL) = -1 EAGAIN (Resource
>>> temporarily
>>> unavailable)
>>> futex(0x7fff6ebe1470, FUTEX_WAIT, 1, NULL
>>>
>>>
>>> Also, here's the end of the trace file for the shepherd:
>>>
>>>
>>> 11/21/2008 09:11:49 [1084:2942]: args[0] =
>>> "/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter"
>>>
>>> 11/21/2008 09:11:49 [1084:2942]: args[1] =
>>> "/var/spool/sge/default/spool/cn3-83/active_jobs/12040.1"
>>>
>>> 11/21/2008 09:11:49 [1084:2942]: args[2] = "noshell"
>>>
>>> 11/21/2008 09:11:49 [1084:2942]:
>>> execvp(/oicr/cluster/sge6.2/utilbin/lx24-amd64/qrsh_starter, ...);
>>> 11/21/2008 09:14:21 [0:2941]: pty_to_commlib: closing pipe to child
>>> 11/21/2008 09:14:21 [0:2941]: wait3 returned 2942 (status: 0;
>>> WIFSIGNALED: 0,
>>> WIFEXITED: 1, WEXITSTATUS: 0)
>>> 11/21/2008 09:14:21 [0:2941]: pty_to_commlib: append_to_buf()
>>> returned -1,
>>> errno 0, Success -> exiting
>>>
>>> If I restart sge_execd on the execution node, sge considers the job
>>> failed.
>> was there any process jumping out of the process tree, i.e. a fork or
>> &, and hanging at the end of the process table?
> 
> No. The processes did not appear to detach from the calling shepherd.
> 
> I did manage to get a workaround by using ssh as our transport. Under that,
> the shepherds terminate properly.
> 
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89676
> 
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].


-- 
Sun Microsystems GmbH         Harald Pollinger
Dr.-Leo-Ritter-Str. 7         N1 Grid Engine Engineering
D-93049 Regensburg            Phone: +49 (0)941 3075-209  (x60209)
Germany                       Fax: +49 (0)941 3075-222  (x60222)
http://www.sun.com/gridware
mailto:harald.pollinger at sun.com
Sitz der Gesellschaft: Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Thomas Schroeder, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=89705

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list