[GE users] SGE 6 - queues entering error state
Bevan C. Bennett
bevan at fulcrummicro.com
Thu Aug 10 21:32:26 BST 2006
[ The following text is in the "ISO-8859-1" character set. ]
[ Your display is set for the "ISO-8859-10" character set. ]
[ Some special characters may be displayed incorrectly. ]
> Am 10.08.2006 um 20:24 schrieb Bevan C. Bennett:
>>> can you please post your queue, sge and exechost configuration.
>> Which parts of it?
> The first few lines form the SGE conf, where the spool directories are
> defined, and maybe you have a local configuration for some of your hosts
> (qconf -sconfl)? And prolog/epilog in any of them?
It's pretty basic...
[bevan at alexander ~]$ qconf -sconf
rsh_daemon /usr/sbin/sshd-grid -i
rlogin_daemon /usr/sbin/sshd-grid -i
All the local configurations are empty.
>>> The only thing I see from this is, that the "pid" doesn't belong into
>>> /scratch/2313.1.all.q/pid, but into
>> I know... for all my correctly running jobs this is what happens.
>> Could the user be accidently setting some environment variable that
>> points SGE
>> to $TMPDIR instead of the spool directory?
> Was it an interactive qlogin/qrsh, or a batch qsub/qrsh? Do you see this
> happen only on certain hosts? Is there any prolog/epilog, either global
> or for a queue/host?
I'm seeing it happen for certain users. It looks like it was interactive, but
I'm trying to convince them to make these more batch friendly.
No prolog/epilogs anywhere.
For at least one of these users, the job sets the queue to error, gets
re-queued, sets the next queue to error, gets re-queued, etc... until all my
queue instances are in error state and the system is locked down.
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net
More information about the gridengine-users