[GE users] Spurious errors with checkpointing
dieter_ruppert at siemens.com
Wed Dec 5 10:55:58 GMT 2007
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
we use Gridengine 6.0u6 on Solaris10/Sparc and are occasionally
getting errors "can't stat ... as stdout_path: Permission denied"
from jobs which are being suspended in subordinate queues.
The setup is the following: we have two queues per host, one for
potentially long running jobs (t) and one for "immediate" jobs (b), with
t being subordinate to b. Thus, when a job in b starts, a job in t
is being migrated to an other, usually less powerful machine by a
checkpointing environment attached to t which does a kill -USR1 to
the job's process group.
All this works usually as it should, but from time to time a job in t
goes into an error state with the above message. The stdout_path
is, of course, owned and writable by the user of the job.
More information about the gridengine-users