[GE users] LOTS of GE 6.1u2: Job 1157200 failed

Bevan C. Bennett bevan at fulcrummicro.com
Tue Jul 15 00:45:50 BST 2008


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]


I'm seeing a lot of intermittent failures on my grid. They always look
like the following and end up marking the host that tried to run them as
error state. Left alone they'll mark the whole system as error and bring
everything to a crashing halt.

Does anyone have any idea what might cause these?

Does anyone have any similar issues?

We've been band-aiding the situation by scanning the logs and clearing
the error states, but that's not maintainable.

They all say roughly the same thing:

Job 1157200 caused action: Job 1157200 set to ERROR
 User        = rozdag
 Queue       = all.q at ytterbium.internal.avlsi.com
 Host        = ytterbium.internal.avlsi.com
 Start Time  = <unknown>
 End Time    = <unknown>
failed before job:04/23/2008 14:08:01 [0:10801]: can't open file
job_pid: Permission denied
Shepherd trace:
04/23/2008 14:08:01 [5143:10800]: shepherd called with uid = 0, euid = 5143
04/23/2008 14:08:01 [5143:10800]: starting up 6.1u2
04/23/2008 14:08:01 [5143:10800]: setpgid(10800, 10800) returned 0
04/23/2008 14:08:01 [5143:10800]: no prolog script to start
04/23/2008 14:08:01 [5143:10801]: processing qlogin job
04/23/2008 14:08:01 [5143:10801]: pid=10801 pgrp=10801 sid=10801 old
pgrp=10800 getlogin()=<no login set>
04/23/2008 14:08:01 [5143:10801]: reading passwd information for user 'root'
04/23/2008 14:08:01 [5143:10801]: setosjobid: uid = 0, euid = 5143
04/23/2008 14:08:01 [5143:10800]: forked "job" with pid 10801
04/23/2008 14:08:01 [5143:10801]: setting limits
04/23/2008 14:08:01 [5143:10801]: RLIMIT_CPU setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
04/23/2008 14:08:01 [5143:10801]: RLIMIT_FSIZE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
04/23/2008 14:08:01 [5143:10801]: RLIMIT_DATA setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
04/23/2008 14:08:01 [5143:10801]: RLIMIT_STACK setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
04/23/2008 14:08:01 [5143:10801]: RLIMIT_CORE setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
04/23/2008 14:08:01 [5143:10801]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
04/23/2008 14:08:01 [5143:10801]: RLIMIT_RSS setting: (soft
18446744073709551615 hard 18446744073709551615) resulting: (soft
18446744073709551615 hard 18446744073709551615)
04/23/2008 14:08:01 [5143:10801]: setting environment
04/23/2008 14:08:01 [5143:10801]: Initializing error file
04/23/2008 14:08:01 [5143:10800]: child: job - pid: 10801
04/23/2008 14:08:01 [5143:10801]: switching to intermediate/target user
04/23/2008 14:08:01 [9086:10801]: closing all filedescriptors
04/23/2008 14:08:01 [9086:10801]: further messages are in "error" and
"trace"
04/23/2008 14:08:01 [0:10801]: now running with uid=0, euid=0
04/23/2008 14:08:01 [0:10801]: start qlogin
04/23/2008 14:08:01 [0:10801]: calling
qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/ytterbium/active_jobs/1157200.1,
/usr/sbin/sshd-grid -i);
04/23/2008 14:08:01 [0:10801]: uid = 0, euid = 0, gid = 0, egid = 0
04/23/2008 14:08:01 [0:10801]: using sfd 1
04/23/2008 14:08:01 [0:10801]: bound to port 52161
04/23/2008 14:08:01 [0:10801]: write_to_qrsh - data =
0:52161:/usr/local/grid-6.0/utilbin/lx24-amd64:/mnt/fulcrum/local/common/grid-6.0/default/spool/ytterbium/active_jobs/1157200.1:ytterbium.internal.avlsi.com
04/23/2008 14:08:01 [0:10801]: write_to_qrsh - address = niobium:55167
04/23/2008 14:08:01 [0:10801]: write_to_qrsh - host = niobium, port = 55167
04/23/2008 14:08:01 [0:10801]: error connecting stream socket:
Connection refused
04/23/2008 14:08:01 [0:10801]: communication with qrsh failed
04/23/2008 14:08:01 [0:10801]: forked "job" with pid 0
04/23/2008 14:08:01 [0:10801]: can't open file job_pid: Permission denied
04/23/2008 14:08:01 [0:10801]: write_to_qrsh - data = 1:can't open file
job_pid: Permission denied
04/23/2008 14:08:01 [0:10801]: write_to_qrsh - address = niobium
04/23/2008 14:08:01 [0:10801]: illegal value for qrsh_control_port:
"niobium". Should be host:port
04/23/2008 14:08:01 [5143:10800]: wait3 returned 10801 (status: 2816;
WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
04/23/2008 14:08:01 [5143:10800]: job exited with exit status 11
04/23/2008 14:08:01 [5143:10800]: reaped "job" with pid 10801
04/23/2008 14:08:01 [5143:10800]: job exited not due to signal
04/23/2008 14:08:01 [5143:10800]: job exited with status 11
04/23/2008 14:08:01 [0:10800]: can't open file
/scratch/1157200.1.all.q/pid: No such file or directory
04/23/2008 14:08:01 [0:10800]: write_to_qrsh - data = 1:can't open file
/scratch/1157200.1.all.q/pid: No such file or directory
04/23/2008 14:08:01 [0:10800]: write_to_qrsh - address = niobium:55167
04/23/2008 14:08:01 [0:10800]: write_to_qrsh - host = niobium, port = 55167
04/23/2008 14:08:01 [0:10800]: error connecting stream socket:
Connection refused

Shepherd error:
04/23/2008 14:08:01 [0:10801]: can't open file job_pid: Permission denied
04/23/2008 14:08:01 [0:10800]: can't open file
/scratch/1157200.1.all.q/pid: No such file or directory

Shepherd pe_hostfile:
ytterbium.internal.avlsi.com 1 all.q at ytterbium.internal.avlsi.com <NULL>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list