[GE users] bug? with -o /nfs/path

Rayson Ho rayrayson at gmail.com
Thu Jun 14 00:48:25 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Good suggestion!

And, in order to make the shepherd (and also execd) code more robust,
we could add some logic in the file routines to do retries. The first
step may be to cache which directories (using fstype.c to find out the
FS type) are NFS mounted - in the case of failure, look in the cache
and see if the file is one of the NFS dirs... and if so relex the
error reporting/failure a bit for NFS or any distributed/networked
filesystems.

This could also handle the problem of radmon failures with other
network FS that were reported a few weeks earlier!!

Rayson




On 6/13/07, Daniel Templeton <Dan.Templeton at sun.com> wrote:
> Actually,  it's the shepherd, so instead of fprintf(), you can use
> shepherd_trace() and it will be routed to the trace file in the job
> directory.
>
> Daniel
>
> Daniel
>
> Rayson Ho wrote:
> > Indeed.
> >
> > The shepherd is doing very similar to what you described below. In
> > source file daemons/shepherd/builtin_starter.c, function build_path():
> >
> >   /* Try to get information about 'base' */
> >   if( SGE_STAT(base, &statbuf)) {
> >      /* An error occured */
> >      if (errno != ENOENT) {
> >
> >      ...
> >      ...
> >      }
> >      return base; /* does not exist - must be path of file to be
> > created */
> >   }
> >
> > I don't have a really busy NFS environment to test the errno value for
> > NFS read (is it really ENOENT??), but I assume if we know the value,
> > we can easily add a retry loop in the code above.
> >
> > Since you know C, can you use try adding debug fprintf()s to the code
> > above -- but redirect the fprintf output to a file since there is no
> > stdout/stderr for daemon processes.
> >
> > Rayson
> >
> >
> > On 6/13/07, Christopher McCrory <chrismcc at pricegrabber.com> wrote:
> >> I suspect sge is doing something like:
> >>
> >> if "stat $log-dir" ==  a directory
> >>  then treat as a directory
> >> else
> >>  treat as a file
> >>
> >> without checking for other conditions or retrying a timeout failure.
> >>
> >> NOTE:  I haven't actually looked at the code, nor am I a very good C
> >> programmer, so my analysis could be either wrong, or very wrong.  :)
> >>
> >>
> >> questions:
> >>
> >> Does this sound like a SGE bug?
> >>
> >> Would adding a trailing slash ( qsub -o /nfs/dir/ ) force sge to
> >> treat the argument as a directory and not a file?
> >>
> >>
> >> thanks
> >>
> >> p.s. SGE has "just worked" for us for the past several years hence my
> >> not posting to this list recently.
> >>
> >>
> >>
> >> --
> >> Christopher McCrory
> >>  "The guy that keeps the servers running"
> >>
> >> To the optimist, the glass is half full.
> >> To the pessimist, the glass is half empty.
> >> To the engineer, the glass is twice as big as it needs to be.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list