[GE users] SGE 6 - queues entering error state

Reuti reuti at staff.uni-marburg.de
Wed Aug 9 22:05:58 BST 2006


Hi,

Am 09.08.2006 um 18:57 schrieb Bevan C. Bennett:

> We've been running a SGE5 installation for a number of years, but  
> I'm trying to
> bring up and test an SGE6 installation so we can eventually migrate  
> over.
>
> Early testing has been having a disturbingly large quantity of  
> queue instances
> set to the error state due to some mysterious issues that don't  
> appear to be
> related to the job itself.
>
> /scratch is a world writable directory present on all nodes that we  
> use as
> tmpdir. Does anyone know what might be actually causing these error  
> messages?
> Even a pointer for where to investigate next would be helpful, as I  
> can't go
> live until this is somehow resolved.
>
> --------------------------------------------------------
>
> Job 2313 caused action: Queue "all.q at cobalt" set to ERROR
>  User        = user
>  Queue       = all.q at cobalt
>  Host        = cobalt
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed before job:08/08/2006 19:54:58 [0:21070]: cant open file
> /scratch/2313.1.all.q/pid: No such file or directory
> Shepherd trace:
> 08/08/2006 19:54:57 [5143:21070]: shepherd called with uid = 0,  
> euid = 5143
> 08/08/2006 19:54:57 [5143:21070]: starting up maintrunk
> 08/08/2006 19:54:57 [5143:21070]: setpgid(21070, 21070) returned 0
> 08/08/2006 19:54:57 [5143:21070]: no prolog script to start
> 08/08/2006 19:54:57 [5143:21071]: processing qlogin job
> 08/08/2006 19:54:57 [5143:21071]: pid=21071 pgrp=21071 sid=21071  
> old pgrp=21070
> getlogin()=<no login set>
> 08/08/2006 19:54:57 [5143:21071]: reading passwd information for  
> user 'root'
> 08/08/2006 19:54:57 [5143:21070]: forked "job" with pid 21071
> 08/08/2006 19:54:57 [5143:21071]: setosjobid: uid = 0, euid = 5143
> 08/08/2006 19:54:57 [5143:21070]: child: job - pid: 21071
> 08/08/2006 19:54:57 [5143:21071]: setting limits
> 08/08/2006 19:54:57 [5143:21071]: RLIMIT_CPU setting: (soft  
> 4294967295 hard
> 4294967295) resulting: (soft 4294967295 hard 4294967295)
> 08/08/2006 19:54:57 [5143:21071]: RLIMIT_FSIZE setting: (soft  
> 4294967295 hard
> 4294967295) resulting: (soft 4294967295 hard 4294967295)
> 08/08/2006 19:54:57 [5143:21071]: RLIMIT_DATA setting: (soft  
> 4294967295 hard
> 4294967295) resulting: (soft 4294967295 hard 4294967295)
> 08/08/2006 19:54:57 [5143:21071]: RLIMIT_STACK setting: (soft  
> 4294967295 hard
> 4294967295) resulting: (soft 4294967295 hard 4294967295)
> 08/08/2006 19:54:57 [5143:21071]: RLIMIT_CORE setting: (soft  
> 4294967295 hard
> 4294967295) resulting: (soft 4294967295 hard 4294967295)
> 08/08/2006 19:54:57 [5143:21071]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 4294967295 hard 4294967295) resulting: (soft 4294967295 hard  
> 4294967295)
> 08/08/2006 19:54:57 [5143:21071]: RLIMIT_RSS setting: (soft  
> 4294967295 hard
> 4294967295) resulting: (soft 4294967295 hard 4294967295)
> 08/08/2006 19:54:57 [5143:21071]: setting environment
> 08/08/2006 19:54:57 [5143:21071]: Initializing error file
> 08/08/2006 19:54:57 [5143:21071]: switching to intermediate/target  
> user
> 08/08/2006 19:54:57 [9140:21071]: closing all filedescriptors
> 08/08/2006 19:54:57 [9140:21071]: further messages are in "error"  
> and "trace"
> 08/08/2006 19:54:57 [0:21071]: now running with uid=0, euid=0
> 08/08/2006 19:54:57 [0:21071]: start qlogin
> 08/08/2006 19:54:57 [0:21071]: calling
> qlogin_starter(/mnt/local/common/grid-test/default/spool/cobalt/ 
> active_jobs/2313.1,
> /usr/sbin/sshd-grid -i);
> 08/08/2006 19:54:57 [0:21071]: uid = 0, euid = 0, gid = 0, egid = 0
> 08/08/2006 19:54:57 [0:21071]: using sfd 1
> 08/08/2006 19:54:57 [0:21071]: bound to port 51420
> 08/08/2006 19:54:57 [0:21071]: write_to_qrsh - data =
> 0:51420:/usr/local/grid-6.0/utilbin/lx26-x86:/mnt/local/common/grid- 
> test/default/spool/cobalt/active_jobs/2313.1:cobalt
> 08/08/2006 19:54:57 [0:21071]: write_to_qrsh - address = darius:53184
> 08/08/2006 19:54:57 [0:21071]: write_to_qrsh - host = darius, port  
> = 53184
> 08/08/2006 19:54:57 [0:21071]: waiting for connection.
> 08/08/2006 19:54:57 [0:21071]: accepted connection on fd 2
> 08/08/2006 19:54:57 [0:21071]: daemon to start: |/usr/sbin/sshd- 
> grid -i|
> 08/08/2006 19:54:58 [5143:21070]: wait3 returned 21071 (status: 0;  
> WIFSIGNALED:
> 0,  WIFEXITED: 1, WEXITSTATUS: 0)
> 08/08/2006 19:54:58 [5143:21070]: job exited with exit status 0
> 08/08/2006 19:54:58 [5143:21070]: reaped "job" with pid 21071
> 08/08/2006 19:54:58 [5143:21070]: job exited not due to signal
> 08/08/2006 19:54:58 [5143:21070]: job exited with status 0
> 08/08/2006 19:54:58 [0:21070]: cant open file /scratch/2313.1.all.q/ 
> pid: No such
> file or directory
> 08/08/2006 19:54:58 [0:21070]: write_to_qrsh - data = 1:cant open file
> /scratch/2313.1.all.q/pid: No such file or directory
> 08/08/2006 19:54:58 [0:21070]: write_to_qrsh - address = darius:53184
> 08/08/2006 19:54:58 [0:21070]: write_to_qrsh - host = darius, port  
> = 53184
>
> Shepherd error:
> 08/08/2006 19:54:58 [0:21070]: cant open file /scratch/2313.1.all.q/ 
> pid: No such
> file or directory
>
> Shepherd pe_hostfile:
> cobalt.internal.avlsi.com 1 all.q at cobalt <NULL>

can you please post your queue, sge and exechost configuration.

The only thing I see from this is, that the "pid" doesn't belong  
into /scratch/2313.1.all.q/pid, but into /mnt/local/common/grid-test/ 
default/spool/cobalt/active_jobs/2313.1/pid.

Just for curiosity: which type of jobs are you running, to put  
$TMPDIR on a shared space? Many jobs benefit from a local $TMPDIR.

-- Reuti

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list