[GE users] job dies due to "signal PIPE (13)" with exit code 141 -- any suggestions ?

Chris Dagdigian dag at sonsorol.org
Fri Sep 24 16:28:36 BST 2004


This is not a SGE problem -- I'm trying to debug why a job fails. Any 
hints or tips would be appreciated! It has something to do with trying 
to launch a long running job via perl script...

Is there a limit to how long you can keep an open perl pipe? Does perl 
introduce any sorts of limitations that are not in a regular user shell 
environment?

Detail:
=======

I have a computational biology application that is failing under SGEE 
5.3 only during a paricular usage case (invoked via CGI, using perl).

The app is a simple but computationally intensive binary that takes in 1 
input/control file and outputs a text file full of analytical data.

I have an input file for this app that takes many hours to complete.

It works perfectly when:

o user invokes app via command-line
o user runs the job via 'qrsh' from the commandline
o user runs the job via "qsub" from the commandline

It fails a the 13hour mark when:

o CGI is used to launch the job via a perl script that calls 'qrsh'

The perl script is doing this:
==============================
map { close $_ } (0..15);
open (STDIN, "/dev/null");
open (STDERR, "> $SCRATCH_DIR/app.err");
open (STDOUT, "> $SCRATCH_DIR/app.out");
$ENV{'SGE_ROOT'}="/common/sge";

my $retcode=system("qrsh -cwd -V -now no -N \"A28\"  \" app -n 
testfile.in -l app.out \"   ");

my $retcode=system(" sleep 1 ; /common/biotools/bin/LogJob.pl dag A28");

close(STDIN);close(STDOUT);close(STDERR);
================================

The CGI and perl script is running as the same user uid/gid that can 
successfully run the job on the command-line via qrsh and qsub.


The failure is one I have not seen before:

qmaster messages file has this to say:
--------------------------------------
Fri Sep 24 00:57:15 2004|qmaster|xserve|W|job 1464.1 failed on host 
node01.cluster.private  assumedly after job because: job 1464.1 died 
through signal PIPE (13)


qaact has this to say:
----------------------
failed       100 : assumedly after job
exit_status  141




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list