[GE users] bug? with -o /nfs/path
chrismcc at pricegrabber.com
Wed Jun 13 23:53:45 BST 2007
[ The following text is in the "utf-8" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some characters may be displayed incorrectly. ]
We have a SGE install that has been running fine for a couple years with a combination of RHEL3 x86 and RHEL5 x86_64 servers running sge 6.0u6 (sge master) or 6.0u7 (exec hosts). The exec hosts nfs auto mount all the data they crunch.
I recently found out about a bug we have periodically (maybe twice per month) run into for a while. We have a parent process that kicks off a good number of jobs, something like:
log-dir="/nfs/log/sge-logs/$YYYY-MM-DD-HH-MM" # no trailing slash
# within loop
qsub $other-args -o $log-dir
the error we get from the error email is something like:
failed opening input/output file:06/13/2007 02:00:33 [600:3508]: error:
can't open output file "/nfs/log/sge-logs/2007-06-13-02
06/13/2007 02:00:33 [600:3508]: error: can't open output file
"/nfs/log/sge-logs/2007-06-13-02": Is a directory
There could be some other issue on our scripting side, but I suspect the the problem is that all the exec hosts are trying to nfs mount /nfs/log/ at the same time while the nfs server is somewhat busy. This causes the log-dir to "not quite be there" for a few seconds.
I suspect sge is doing something like:
if "stat $log-dir" == a directory
then treat as a directory
treat as a file
without checking for other conditions or retrying a timeout failure.
NOTE: I haven't actually looked at the code, nor am I a very good C programmer, so my analysis could be either wrong, or very wrong. :)
Does this sound like a SGE bug?
Would adding a trailing slash ( qsub -o /nfs/dir/ ) force sge to treat the argument as a directory and not a file?
p.s. SGE has "just worked" for us for the past several years hence my not posting to this list recently.
"The guy that keeps the servers running"
To the optimist, the glass is half full.
To the pessimist, the glass is half empty.
To the engineer, the glass is twice as big as it needs to be.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users