[GE users] job dies due to "signal PIPE (13)" with exit code 141 -- any suggestions ?

Ron Chen ron_chen_123 at yahoo.com
Tue Sep 28 03:42:48 BST 2004


You can try using SSH as transport protocol for qrsh
-- this way you will get less dependent on SGE, and
then see if you can reproduce the problem or not.

 -Ron



--- Chris Dagdigian <dag at sonsorol.org> wrote:
> This is not a SGE problem -- I'm trying to debug why
> a job fails. Any 
> hints or tips would be appreciated! It has something
> to do with trying 
> to launch a long running job via perl script...
> 
> Is there a limit to how long you can keep an open
> perl pipe? Does perl 
> introduce any sorts of limitations that are not in a
> regular user shell 
> environment?
> 
> Detail:
> =======
> 
> I have a computational biology application that is
> failing under SGEE 
> 5.3 only during a paricular usage case (invoked via
> CGI, using perl).
> 
> The app is a simple but computationally intensive
> binary that takes in 1 
> input/control file and outputs a text file full of
> analytical data.
> 
> I have an input file for this app that takes many
> hours to complete.
> 
> It works perfectly when:
> 
> o user invokes app via command-line
> o user runs the job via 'qrsh' from the commandline
> o user runs the job via "qsub" from the commandline
> 
> It fails a the 13hour mark when:
> 
> o CGI is used to launch the job via a perl script
> that calls 'qrsh'
> 
> The perl script is doing this:
> ==============================
> map { close $_ } (0..15);
> open (STDIN, "/dev/null");
> open (STDERR, "> $SCRATCH_DIR/app.err");
> open (STDOUT, "> $SCRATCH_DIR/app.out");
> $ENV{'SGE_ROOT'}="/common/sge";
> 
> my $retcode=system("qrsh -cwd -V -now no -N \"A28\" 
> \" app -n 
> testfile.in -l app.out \"   ");
> 
> my $retcode=system(" sleep 1 ;
> /common/biotools/bin/LogJob.pl dag A28");
> 
> close(STDIN);close(STDOUT);close(STDERR);
> ================================
> 
> The CGI and perl script is running as the same user
> uid/gid that can 
> successfully run the job on the command-line via
> qrsh and qsub.
> 
> 
> The failure is one I have not seen before:
> 
> qmaster messages file has this to say:
> --------------------------------------
> Fri Sep 24 00:57:15 2004|qmaster|xserve|W|job 1464.1
> failed on host 
> node01.cluster.private  assumedly after job because:
> job 1464.1 died 
> through signal PIPE (13)
> 
> 
> qaact has this to say:
> ----------------------
> failed       100 : assumedly after job
> exit_status  141
> 
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 
> 



		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - Send 10MB messages!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list