[GE users] sge5.3 - delay between dispatching?
Dan.Gruhn at Group-W-Inc.com
Fri Mar 11 13:36:37 GMT 2005
I have the same type of problem. I use an NFS compatible mutual
exclusion technique that involves file link to another file. Only the
first job to try the link gets a success and all others fail.
ln lockfile lockfile.lock >/dev/null 2>&1
while [ $result -ne 0 ]
ln lockfile lockfile.lock
# Do your file creation stuff here ...
rm -f lockfile lockfile.lock
On Fri, 2005-03-11 at 07:27, Justus Loerke wrote:
> I'm looking for a way to set a defined time delay (say 5 or 10 secs)
> between the dispatching of subsequent jobs to execution hosts. Sorry if
> I'm posting to the group with this, but I didn't find anything in the
> archive or the docs.
> I'm having some problems with dispatching and the NFS system: if 2 (or
> more) jobs are dispatched and started on different execution hosts at
> the same time, these jobs (namely spider) will try to open a results
> file with the same name on the NFS shared directory; this will crash all
> jobs but one with a 'stale NFS handle' error. If a filename was in use,
> the jobs would just use the next free name, but the real problem is that
> several jobs try to create the same file _at the same time_. Use of
> local (execution host) disks for the destination of the results file is
> something we're checking right now, but it clutters up local disks,
> distributes debug information over too many hosts and is just not
> elegant, you know? :)
> So is there a way to reconfigure the scheduler to wait a defined time
> interval between the dispatching of jobs? This would solve my problem,
> since newly dispatched jobs would try opening results files in intervals
> of 5 or 10 secs and would then use different file names.
> Thanks, Justus.
More information about the gridengine-users