[GE users] bug? with -o /nfs/path

Daniel Templeton Dan.Templeton at Sun.COM
Thu Jun 14 00:19:49 BST 2007


    [ The following text is in the "UTF-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Chris,

Actually, your pseudo-code is correct.  In the son() function in 
builtin_starter.c, the shepherd checks to see if the path is a file.  If 
it is, it tries to open it.  If not, it creates a file named according 
to the job id in the directory.  I doubt the trailing slash will have 
much effect.  The path is decided to be a file by the stat() system 
call, which makes no mention of caring about the trailing slash.

It appears that the NFS server is first reporting the path as the file 
and then later as a directory.  That might be a symptom of being 
overloaded, or it could be an issue with SGE.  I didn't check out the u6 
source base to confirm how things work there, but there have been a 
large number of issues fixed in the last 4 updates, so I can't swear 
that it's not a Grid Engine issue.

Daniel

Christopher McCrory wrote:
> Hello...
>
>
> We have a SGE install that has been running fine for a couple years with a combination of RHEL3 x86 and RHEL5 x86_64 servers running sge 6.0u6 (sge master) or 6.0u7 (exec hosts).  The exec hosts nfs auto mount all the data they crunch.
>
> I recently found out about a bug we have periodically (maybe twice per month) run into for a while.  We have a parent process that kicks off a good number of jobs, something like:
>
> log-dir="/nfs/log/sge-logs/$YYYY-MM-DD-HH-MM"  # no trailing slash
> mkdir "/nfs/log/sge-logs/$YYYY-MM-DD-HH-MM"
> # within loop
> qsub $other-args -o $log-dir
> ...
>
>
> the error we get from the error email is something like:
> failed opening input/output file:06/13/2007 02:00:33 [600:3508]: error:
> can't open output file "/nfs/log/sge-logs/2007-06-13-02 
>
> <snip>
>
> 06/13/2007 02:00:33 [600:3508]: error: can't open output file
> "/nfs/log/sge-logs/2007-06-13-02": Is a directory 
>
>
>
> There could be some other issue on our scripting side, but I suspect the the problem is that all the exec hosts are trying to nfs mount /nfs/log/ at the same time while the nfs server is somewhat busy.  This causes the log-dir to "not quite be there" for a few seconds.
>
> I suspect sge is doing something like:
>
> if "stat $log-dir" ==  a directory
>  then treat as a directory
> else
>  treat as a file
>
> without checking for other conditions or retrying a timeout failure.
>
> NOTE:  I haven't actually looked at the code, nor am I a very good C programmer, so my analysis could be either wrong, or very wrong.  :)
>
>
> questions:
>
> Does this sound like a SGE bug?
>
> Would adding a trailing slash ( qsub -o /nfs/dir/ ) force sge to treat the argument as a directory and not a file?
>
>
> thanks
>
> p.s. SGE has "just worked" for us for the past several years hence my not posting to this list recently.
>
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list