[GE users] TMPDIR issues

Brian Smith brs at usf.edu
Mon Apr 28 16:59:46 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Craig,

Yeah, I definitely made sure that his script and all other supporting 
scripts weren't blowing the directory out.

I think I figured out the problem... For years I've pointed tmpdir in 
queue_conf to an NFS location and this worked correctly so I never 
suspected that it might be the problem.  It was nice to have because 
there was a one-stop place for all of the temporary directories in case 
I had to troubleshoot something.  I poured over the documentation to see 
if there were any suggestions about where to point tmpdir and it 
mentioned nothing about NFS vs. local disk.   At one point it was 
suggested to me to use a shared NFS directory (that was a long time 
ago...) and so I went with it. 

These problems really only started popping up once we moved from nfsd on 
Linux to nfsd on Solaris 10 with ZFS.  On any installation where a Linux 
box with ext3 is the NFS server for SGE_ROOT and tmpdir, everything 
worked fine... directory gets created, qrsh_client_cache is created 
correctly and so parallel jobs executed properly.  Now, with NFS on top 
of ZFS on a solaris 10 server, this strange behavior started occurring 
as described in bug #2393.  It seems like it could be some sort of weird 
effect on an async mount where one nfsd thread is aware of the latest 
commit and another active thread might not be (that would be very very 
bad)... I'm still scratching my head trying to figure out what the 
differences might be in the NFS implementations (there are many) and in 
the file system interaction between ZFS/NFS and ext3/NFS (there are 
definitely differences here as well).

I've been going over the source for execd to see if I can't piece 
together more information from that.  It looks simple enough so I'm 
having some trouble figuring out just how NFS might cause something like 
what I'm experiencing.  I'm going to see if switching to a local disk 
tmpdir resolves my issues.

Regards,
-Brian

Craig Tierney wrote:
> Brian Smith wrote:
>> Hi all,
>>
>> I suppose that for a parallel job, the execd of the MASTER "node" is 
>> responsible for removing the TMPDIR of the job after it is done 
>> executing.  Any idea what would case that directory to not get 
>> created or to not be there when execd goes to clear it out?  I get a 
>> lot of these sorts of messages and I think it might have something to 
>> do with my bug #2393
>>
>> 04/27/2008 10:03:49|execd|rcn-ib-0002|E|recursive 
>> rmdir(/opt/sge/tmp/39038.1.rcnib.q): 
>> opendir(/opt/sge/tmp/39038.1.rcnib.q) failed: No such file or directory
>>
>
> Did you verify that job 39038 (or the other jobs with this problem) 
> aren't
> causing the problem?  Every once and awhile when a new user comes to our
> system, their scripts are setup to create and cleanup a directory
> which they name $TMPDIR.  This generally causes confusion between 
> their job
> and the system.  If your user has a script like this, and is cleaning up
> $TMPDIR at the end of their script, Gridengine may return an error 
> message
> like the one you are seeing.
>
> Craig
>
>
>
>
>> Thanks,
>> -Brian
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list