[GE users] bug? with -o /nfs/path

tmac tmacmd at gmail.com
Sun Jun 17 21:14:11 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Well, we saw something very similar:

Problem #1: AUTOmounter. There are edge cases where, when the automounter is
just ready to umount something, a process comes along to try and use it the
same split second it is to be released. Well, the application fails, but if
tried again immediatly, it works. The solution is a bad hack:
either modify /etc/sysconfig/autofs to have "--timeout=0" or put
"--timeout=0" after each line in your autofs table that you do not want to
have problems with.

This is supposed to be fixed in 2.6.9-32.

Problem #2: Running out of all memory (real + swap) does something very
similar. The code was perl and it was calling system commands through
"backticking". I increased swap space and the problem, as far as I know has
not returned.


On 6/14/07, Olle Liljenzin <olle.liljenzin at jeppesen.com> wrote:
>
> It could be worth mentioning that we had a lot of automount failures and
> other nfs problems in the past caused by rsh putting low port numbers in
> TIME_WAIT state. It happened when we were running a numer of short
> commands with qrsh, e.g. in a qmake, and used rsh as the underlying
> protocol for qrsh. When all port below 1024 are in use or in TIME_WAIT,
> many system functions like nfs mounting will fail without relevant error
> messages. Changing to ssh fixed this for us.
>
> /Olle
>
> Christopher McCrory wrote:
> > Hello...
> >
> >
> > We have a SGE install that has been running fine for a couple years with
> a combination of RHEL3 x86 and RHEL5 x86_64 servers running sge 6.0u6 (sge
> master) or 6.0u7 (exec hosts).  The exec hosts nfs auto mount all the data
> they crunch.
> >
> > I recently found out about a bug we have periodically (maybe twice per
> month) run into for a while.  We have a parent process that kicks off a good
> number of jobs, something like:
> >
> > log-dir="/nfs/log/sge-logs/$YYYY-MM-DD-HH-MM"  # no trailing slash
> > mkdir "/nfs/log/sge-logs/$YYYY-MM-DD-HH-MM"
> > # within loop
> > qsub $other-args -o $log-dir
> > ...
> >
> >
> > the error we get from the error email is something like:
> > failed opening input/output file:06/13/2007 02:00:33 [600:3508]: error:
> > can't open output file "/nfs/log/sge-logs/2007-06-13-02
> >
> > <snip>
> >
> > 06/13/2007 02:00:33 [600:3508]: error: can't open output file
> > "/nfs/log/sge-logs/2007-06-13-02": Is a directory
> >
> >
> >
> > There could be some other issue on our scripting side, but I suspect the
> the problem is that all the exec hosts are trying to nfs mount /nfs/log/ at
> the same time while the nfs server is somewhat busy.  This causes the
> log-dir to "not quite be there" for a few seconds.
> >
> > I suspect sge is doing something like:
> >
> > if "stat $log-dir" ==  a directory
> >  then treat as a directory
> > else
> >  treat as a file
> >
> > without checking for other conditions or retrying a timeout failure.
> >
> > NOTE:  I haven't actually looked at the code, nor am I a very good C
> programmer, so my analysis could be either wrong, or very wrong.  :)
> >
> >
> > questions:
> >
> > Does this sound like a SGE bug?
> >
> > Would adding a trailing slash ( qsub -o /nfs/dir/ ) force sge to treat
> the argument as a directory and not a file?
> >
> >
> > thanks
> >
> > p.s. SGE has "just worked" for us for the past several years hence my
> not posting to this list recently.
> >
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 
--tmac

RedHat Certified Engineer



More information about the gridengine-users mailing list