[GE users] SGE 6 - queues entering error state

Reuti reuti at staff.uni-marburg.de
Thu Aug 10 22:42:59 BST 2006


Am 10.08.2006 um 22:32 schrieb Bevan C. Bennett:

> Reuti wrote:
>> Am 10.08.2006 um 20:24 schrieb Bevan C. Bennett:
>>
>>>
>>>> can you please post your queue, sge and exechost configuration.
>>>
>>> Which parts of it?
>>
>> The first few lines form the SGE conf, where the spool directories  
>> are
>> defined, and maybe you have a local configuration for some of your  
>> hosts
>> (qconf -sconfl)? And prolog/epilog in any of them?
>
> It's pretty basic...
> [bevan at alexander ~]$ qconf -sconf
> global:
> execd_spool_dir              /usr/local/grid-6.0/default/spool

But this is different from the below mentioned:

/mnt/local/common/grid-test/default/spool/cobalt/active_jobs/2313.1/pid

> mailer                       /bin/mail
> xterm                        /usr/bin/X11/xterm
> load_sensor                  none
> prolog                       none
> epilog                       none
> shell_start_mode             posix_compliant

Often "unix_behavior" is easier for the users, but depends of course  
on your needs.

> login_shells                 sh,ksh,csh,tcsh
> ...
> delegated_file_staging       false
> reprioritize                 false
> rsh_daemon                   /usr/sbin/sshd-grid -i
> rsh_command                  /usr/bin/ssh
> rlogin_daemon                /usr/sbin/sshd-grid -i
> rlogin_command               /usr/bin/ssh

For qlogin you also defined the wrapper like mentioned in the Howto?

> All the local configurations are empty.
>
>>
>>
>>>> The only thing I see from this is, that the "pid" doesn't belong  
>>>> into
>>>> /scratch/2313.1.all.q/pid, but into
>>>> /mnt/local/common/grid-test/default/spool/cobalt/active_jobs/ 
>>>> 2313.1/pid.
>>>
>>> I know... for all my correctly running jobs this is what happens.
>>> Could the user be accidently setting some environment variable that
>>> points SGE
>>> to $TMPDIR instead of the spool directory?
>>
>> Was it an interactive qlogin/qrsh, or a batch qsub/qrsh? Do you  
>> see this
>> happen only on certain hosts? Is there any prolog/epilog, either  
>> global
>> or for a queue/host?
>
> I'm seeing it happen for certain users. It looks like it was  
> interactive, but
> I'm trying to convince them to make these more batch friendly.
> No prolog/epilogs anywhere.
>
> For at least one of these users, the job sets the queue to error, gets
> re-queued, sets the next queue to error, gets re-queued, etc...  
> until all my
> queue instances are in error state and the system is locked down.

This user is known on all nodes with the same UID?

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list