[GE users] Erronous job execution

John Saalwaechter johnsaalwaechter at yahoo.com
Tue Apr 18 14:54:42 BST 2006


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I just encountered the same error in SGE, and the root cause
was somewhat obscure.  Perhaps you have the same issue.  For me
it was the fact that a few nodes in a cluster inadvertently
used NFS soft mounts instead of hard mounts.  This included
mounting $SGE_ROOT as a soft mount.

Depending on your OS, you could check the output of "mount" or
grep through /proc/mounts to see if you have soft-mounted NFS
filesystems.

The problem with soft mounts in a cluster is that after the
timeout period, soft mounts return I/O errors to the
application.  If you create a bottleneck on the bandwidth
to the $SGE_ROOT NFS mount and it's soft-mounted, SGE will
get various I/O failures.

Hope this helps.

By the way, we mount everything both hard and interruptible
(i.e. "hard,intr").

On Sat, 15 Apr 2006, Hairul Ikmal Mohamad Fuzi wrote:
>Hi Andreas,
>
>Thanks for the reply.
>
>Sorry to say that I'm not very clear about this shepherd thingy.
>I would appreciate if somebody can explain to me on 'What is  shepherd
>in SGE terms?' and what does shepherd generally do in SGE?
>
>Regarding your suggestions,
>3) job's active directory : is it the directory where the user's put
>their job script or it is somewhere in the spool directory ?
>5) How do I start/What command should I use to start the shepherd
>using user 'root' ?
>
>And..I'm just wondering .. is this a software/config based error or is
>there any possibility that kind kind of error is caused by hardware
>failure?
>
>Just FYI, I'm using SGE (v6.something) which comes together with Rocks
>4.1 Linux Cluster Distribution.
>
>
>Thanks again!
>
>On 4/12/06, Andreas Haas <Andreas.Haas at sun.com> wrote:
>> Hi Ikmal,
>>
>> it tells you shepherd "failed before writing exit_status".
>> This could mean there was an error condition shepherd could
>> not handle. From shepherd's trace file output I can't assess
>> what might have caused this.


--
johnsaalwaechter at yahoo.com

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list