[GE users] Shepherd errors

Schenker, Martin MSchenker at illumina.com
Fri Jun 29 09:24:39 BST 2007


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Are you using a central spool dir or a local one (on each node)? We had similar random occurences on a Lustre/SFS  system, but switching the spooling to a local dir (/var/spool/sge) seems to have cured this very odd behaviour...

Best, Martin



-----Original Message-----
From: Heywood, Todd [mailto:heywood at cshl.edu]
Sent: 28 June 2007 19:21
To: users at gridengine.sunsource.net
Subject: [GE users] Shepherd errors


I have a recurrent error which is drving me nuts. It occurs for a pipeline
application which submits thousands of jobs for over a 6 hour period.
Sometimes the pipeline finishes fine, and other times it stops with this
error:

In stderr:

error: cannot get connection to "shepherd" at host "blade15"

In email sent to the SGE admin user:

failed before job:06/27/2007 02:57:34 [0:23701]: can't open file
/tmp/1738723.1.solexa.q/pid.339.blade15: No such file

In /var/spool/sge/blade15/messages:

06/27/2007 02:57:34|execd|blade15|E|slave shepherd of job 1738723.1 exited
with exit status = 11

This jjust doesn't happen with blade15, but some random node. Further, this
node has been happily processing pipeline jobs for hours up until this
failure.

Any ideas on how to diagnose this further?

Thanks,

Todd

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list