[GE users] SGE 6 - queues entering error state

Bevan C. Bennett bevan at fulcrummicro.com
Wed Aug 9 17:57:16 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

We've been running a SGE5 installation for a number of years, but I'm trying to
bring up and test an SGE6 installation so we can eventually migrate over.

Early testing has been having a disturbingly large quantity of queue instances
set to the error state due to some mysterious issues that don't appear to be
related to the job itself.

/scratch is a world writable directory present on all nodes that we use as
tmpdir. Does anyone know what might be actually causing these error messages?
Even a pointer for where to investigate next would be helpful, as I can't go
live until this is somehow resolved.

--------------------------------------------------------

Job 2313 caused action: Queue "all.q at cobalt" set to ERROR
 User        = user
 Queue       = all.q at cobalt
 Host        = cobalt
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job:08/08/2006 19:54:58 [0:21070]: cant open file
/scratch/2313.1.all.q/pid: No such file or directory
Shepherd trace:
08/08/2006 19:54:57 [5143:21070]: shepherd called with uid = 0, euid = 5143
08/08/2006 19:54:57 [5143:21070]: starting up maintrunk
08/08/2006 19:54:57 [5143:21070]: setpgid(21070, 21070) returned 0
08/08/2006 19:54:57 [5143:21070]: no prolog script to start
08/08/2006 19:54:57 [5143:21071]: processing qlogin job
08/08/2006 19:54:57 [5143:21071]: pid=21071 pgrp=21071 sid=21071 old pgrp=21070
getlogin()=<no login set>
08/08/2006 19:54:57 [5143:21071]: reading passwd information for user 'root'
08/08/2006 19:54:57 [5143:21070]: forked "job" with pid 21071
08/08/2006 19:54:57 [5143:21071]: setosjobid: uid = 0, euid = 5143
08/08/2006 19:54:57 [5143:21070]: child: job - pid: 21071
08/08/2006 19:54:57 [5143:21071]: setting limits
08/08/2006 19:54:57 [5143:21071]: RLIMIT_CPU setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
08/08/2006 19:54:57 [5143:21071]: RLIMIT_FSIZE setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
08/08/2006 19:54:57 [5143:21071]: RLIMIT_DATA setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
08/08/2006 19:54:57 [5143:21071]: RLIMIT_STACK setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
08/08/2006 19:54:57 [5143:21071]: RLIMIT_CORE setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
08/08/2006 19:54:57 [5143:21071]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
4294967295 hard 4294967295) resulting: (soft 4294967295 hard 4294967295)
08/08/2006 19:54:57 [5143:21071]: RLIMIT_RSS setting: (soft 4294967295 hard
4294967295) resulting: (soft 4294967295 hard 4294967295)
08/08/2006 19:54:57 [5143:21071]: setting environment
08/08/2006 19:54:57 [5143:21071]: Initializing error file
08/08/2006 19:54:57 [5143:21071]: switching to intermediate/target user
08/08/2006 19:54:57 [9140:21071]: closing all filedescriptors
08/08/2006 19:54:57 [9140:21071]: further messages are in "error" and "trace"
08/08/2006 19:54:57 [0:21071]: now running with uid=0, euid=0
08/08/2006 19:54:57 [0:21071]: start qlogin
08/08/2006 19:54:57 [0:21071]: calling
qlogin_starter(/mnt/local/common/grid-test/default/spool/cobalt/active_jobs/2313.1,
/usr/sbin/sshd-grid -i);
08/08/2006 19:54:57 [0:21071]: uid = 0, euid = 0, gid = 0, egid = 0
08/08/2006 19:54:57 [0:21071]: using sfd 1
08/08/2006 19:54:57 [0:21071]: bound to port 51420
08/08/2006 19:54:57 [0:21071]: write_to_qrsh - data =
0:51420:/usr/local/grid-6.0/utilbin/lx26-x86:/mnt/local/common/grid-test/default/spool/cobalt/active_jobs/2313.1:cobalt
08/08/2006 19:54:57 [0:21071]: write_to_qrsh - address = darius:53184
08/08/2006 19:54:57 [0:21071]: write_to_qrsh - host = darius, port = 53184
08/08/2006 19:54:57 [0:21071]: waiting for connection.
08/08/2006 19:54:57 [0:21071]: accepted connection on fd 2
08/08/2006 19:54:57 [0:21071]: daemon to start: |/usr/sbin/sshd-grid -i|
08/08/2006 19:54:58 [5143:21070]: wait3 returned 21071 (status: 0; WIFSIGNALED:
0,  WIFEXITED: 1, WEXITSTATUS: 0)
08/08/2006 19:54:58 [5143:21070]: job exited with exit status 0
08/08/2006 19:54:58 [5143:21070]: reaped "job" with pid 21071
08/08/2006 19:54:58 [5143:21070]: job exited not due to signal
08/08/2006 19:54:58 [5143:21070]: job exited with status 0
08/08/2006 19:54:58 [0:21070]: cant open file /scratch/2313.1.all.q/pid: No such
file or directory
08/08/2006 19:54:58 [0:21070]: write_to_qrsh - data = 1:cant open file
/scratch/2313.1.all.q/pid: No such file or directory
08/08/2006 19:54:58 [0:21070]: write_to_qrsh - address = darius:53184
08/08/2006 19:54:58 [0:21070]: write_to_qrsh - host = darius, port = 53184

Shepherd error:
08/08/2006 19:54:58 [0:21070]: cant open file /scratch/2313.1.all.q/pid: No such
file or directory

Shepherd pe_hostfile:
cobalt.internal.avlsi.com 1 all.q at cobalt <NULL>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list