[GE users] SGE 6 - queues entering error state

Bevan C. Bennett bevan at fulcrummicro.com
Thu Aug 10 21:32:26 BST 2006

    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Reuti wrote:
> Am 10.08.2006 um 20:24 schrieb Bevan C. Bennett:
>>> can you please post your queue, sge and exechost configuration.
>> Which parts of it?
> The first few lines form the SGE conf, where the spool directories are
> defined, and maybe you have a local configuration for some of your hosts
> (qconf -sconfl)? And prolog/epilog in any of them?

It's pretty basic...
[bevan at alexander ~]$ qconf -sconf
execd_spool_dir              /usr/local/grid-6.0/default/spool
mailer                       /bin/mail
xterm                        /usr/bin/X11/xterm
load_sensor                  none
prolog                       none
epilog                       none
shell_start_mode             posix_compliant
login_shells                 sh,ksh,csh,tcsh
delegated_file_staging       false
reprioritize                 false
rsh_daemon                   /usr/sbin/sshd-grid -i
rsh_command                  /usr/bin/ssh
rlogin_daemon                /usr/sbin/sshd-grid -i
rlogin_command               /usr/bin/ssh

All the local configurations are empty.

>>> The only thing I see from this is, that the "pid" doesn't belong into
>>> /scratch/2313.1.all.q/pid, but into
>>> /mnt/local/common/grid-test/default/spool/cobalt/active_jobs/2313.1/pid.
>> I know... for all my correctly running jobs this is what happens.
>> Could the user be accidently setting some environment variable that
>> points SGE
>> to $TMPDIR instead of the spool directory?
> Was it an interactive qlogin/qrsh, or a batch qsub/qrsh? Do you see this
> happen only on certain hosts? Is there any prolog/epilog, either global
> or for a queue/host?

I'm seeing it happen for certain users. It looks like it was interactive, but
I'm trying to convince them to make these more batch friendly.
No prolog/epilogs anywhere.

For at least one of these users, the job sets the queue to error, gets
re-queued, sets the next queue to error, gets re-queued, etc... until all my
queue instances are in error state and the system is locked down.

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list