[GE users] bug? with -o /nfs/path

Olle Liljenzin olle.liljenzin at jeppesen.com
Thu Jun 14 13:23:44 BST 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

It could be worth mentioning that we had a lot of automount failures and 
other nfs problems in the past caused by rsh putting low port numbers in 
TIME_WAIT state. It happened when we were running a numer of short 
commands with qrsh, e.g. in a qmake, and used rsh as the underlying 
protocol for qrsh. When all port below 1024 are in use or in TIME_WAIT, 
many system functions like nfs mounting will fail without relevant error 
messages. Changing to ssh fixed this for us.

/Olle

Christopher McCrory wrote:
> Hello...
>
>
> We have a SGE install that has been running fine for a couple years with a combination of RHEL3 x86 and RHEL5 x86_64 servers running sge 6.0u6 (sge master) or 6.0u7 (exec hosts).  The exec hosts nfs auto mount all the data they crunch.
>
> I recently found out about a bug we have periodically (maybe twice per month) run into for a while.  We have a parent process that kicks off a good number of jobs, something like:
>
> log-dir="/nfs/log/sge-logs/$YYYY-MM-DD-HH-MM"  # no trailing slash
> mkdir "/nfs/log/sge-logs/$YYYY-MM-DD-HH-MM"
> # within loop
> qsub $other-args -o $log-dir
> ...
>
>
> the error we get from the error email is something like:
> failed opening input/output file:06/13/2007 02:00:33 [600:3508]: error:
> can't open output file "/nfs/log/sge-logs/2007-06-13-02 
>
> <snip>
>
> 06/13/2007 02:00:33 [600:3508]: error: can't open output file
> "/nfs/log/sge-logs/2007-06-13-02": Is a directory 
>
>
>
> There could be some other issue on our scripting side, but I suspect the the problem is that all the exec hosts are trying to nfs mount /nfs/log/ at the same time while the nfs server is somewhat busy.  This causes the log-dir to "not quite be there" for a few seconds.
>
> I suspect sge is doing something like:
>
> if "stat $log-dir" ==  a directory
>  then treat as a directory
> else
>  treat as a file
>
> without checking for other conditions or retrying a timeout failure.
>
> NOTE:  I haven't actually looked at the code, nor am I a very good C programmer, so my analysis could be either wrong, or very wrong.  :)
>
>
> questions:
>
> Does this sound like a SGE bug?
>
> Would adding a trailing slash ( qsub -o /nfs/dir/ ) force sge to treat the argument as a directory and not a file?
>
>
> thanks
>
> p.s. SGE has "just worked" for us for the past several years hence my not posting to this list recently.
>
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list