[GE users] Erronous job execution

Hairul Ikmal Mohamad Fuzi hairul.ikmal at gmail.com
Thu Apr 20 15:43:32 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Thanks for the tips John.

I guess this solves the problem.

.ikmal

On 4/18/06, John Saalwaechter <johnsaalwaechter at yahoo.com> wrote:
> I just encountered the same error in SGE, and the root cause
> was somewhat obscure.  Perhaps you have the same issue.  For me
> it was the fact that a few nodes in a cluster inadvertently
> used NFS soft mounts instead of hard mounts.  This included
> mounting $SGE_ROOT as a soft mount.
>
> Depending on your OS, you could check the output of "mount" or
> grep through /proc/mounts to see if you have soft-mounted NFS
> filesystems.
>
> The problem with soft mounts in a cluster is that after the
> timeout period, soft mounts return I/O errors to the
> application.  If you create a bottleneck on the bandwidth
> to the $SGE_ROOT NFS mount and it's soft-mounted, SGE will
> get various I/O failures.
>
> Hope this helps.
>
> By the way, we mount everything both hard and interruptible
> (i.e. "hard,intr").
>
> On Sat, 15 Apr 2006, Hairul Ikmal Mohamad Fuzi wrote:
> >Hi Andreas,
> >
> >Thanks for the reply.
> >
> >Sorry to say that I'm not very clear about this shepherd thingy.
> >I would appreciate if somebody can explain to me on 'What is  shepherd
> >in SGE terms?' and what does shepherd generally do in SGE?
> >
> >Regarding your suggestions,
> >3) job's active directory : is it the directory where the user's put
> >their job script or it is somewhere in the spool directory ?
> >5) How do I start/What command should I use to start the shepherd
> >using user 'root' ?
> >
> >And..I'm just wondering .. is this a software/config based error or is
> >there any possibility that kind kind of error is caused by hardware
> >failure?
> >
> >Just FYI, I'm using SGE (v6.something) which comes together with Rocks
> >4.1 Linux Cluster Distribution.
> >
> >
> >Thanks again!
> >
> >On 4/12/06, Andreas Haas <Andreas.Haas at sun.com> wrote:
> >> Hi Ikmal,
> >>
> >> it tells you shepherd "failed before writing exit_status".
> >> This could mean there was an error condition shepherd could
> >> not handle. From shepherd's trace file output I can't assess
> >> what might have caused this.
>
>
> --
> johnsaalwaechter at yahoo.com
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list