[GE users] LOTS of GE 6.1u2: Job 1157200 failed

Reuti reuti at staff.uni-marburg.de
Tue Jul 15 08:36:21 BST 2008


Hi,

Am 15.07.2008 um 01:45 schrieb Bevan C. Bennett:

>
> I'm seeing a lot of intermittent failures on my grid. They always look
> like the following and end up marking the host that tried to run  
> them as
> error state. Left alone they'll mark the whole system as error and  
> bring
> everything to a crashing halt.
>
> Does anyone have any idea what might cause these?
>
> Does anyone have any similar issues?
>
> We've been band-aiding the situation by scanning the logs and clearing
> the error states, but that's not maintainable.
>
> They all say roughly the same thing:
>
> Job 1157200 caused action: Job 1157200 set to ERROR
>  User        = rozdag
>  Queue       = all.q at ytterbium.internal.avlsi.com
>  Host        = ytterbium.internal.avlsi.com
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed before job:04/23/2008 14:08:01 [0:10801]: can't open file
> job_pid: Permission denied
> Shepherd trace:
> 04/23/2008 14:08:01 [5143:10800]: shepherd called with uid = 0,  
> euid = 5143
> 04/23/2008 14:08:01 [5143:10800]: starting up 6.1u2
> 04/23/2008 14:08:01 [5143:10800]: setpgid(10800, 10800) returned 0
> 04/23/2008 14:08:01 [5143:10800]: no prolog script to start
> 04/23/2008 14:08:01 [5143:10801]: processing qlogin job
> 04/23/2008 14:08:01 [5143:10801]: pid=10801 pgrp=10801 sid=10801 old
> pgrp=10800 getlogin()=<no login set>
> 04/23/2008 14:08:01 [5143:10801]: reading passwd information for  
> user 'root'
> 04/23/2008 14:08:01 [5143:10801]: setosjobid: uid = 0, euid = 5143
> 04/23/2008 14:08:01 [5143:10800]: forked "job" with pid 10801
> 04/23/2008 14:08:01 [5143:10801]: setting limits
> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_CPU setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_FSIZE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_DATA setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_STACK setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_CORE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 04/23/2008 14:08:01 [5143:10801]: RLIMIT_RSS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 04/23/2008 14:08:01 [5143:10801]: setting environment
> 04/23/2008 14:08:01 [5143:10801]: Initializing error file
> 04/23/2008 14:08:01 [5143:10800]: child: job - pid: 10801
> 04/23/2008 14:08:01 [5143:10801]: switching to intermediate/target  
> user
> 04/23/2008 14:08:01 [9086:10801]: closing all filedescriptors
> 04/23/2008 14:08:01 [9086:10801]: further messages are in "error" and
> "trace"
> 04/23/2008 14:08:01 [0:10801]: now running with uid=0, euid=0
> 04/23/2008 14:08:01 [0:10801]: start qlogin
> 04/23/2008 14:08:01 [0:10801]: calling
> qlogin_starter(/mnt/fulcrum/local/common/grid-6.0/default/spool/ 
> ytterbium/active_jobs/1157200.1,
> /usr/sbin/sshd-grid -i);

are the spool directories global on NFS or alike and can it be a  
timing problem? Can you have them also local in /var/spool/sge or so?

http://gridengine.sunsource.net/howto/nfsreduce.html

-- Reuti


> 04/23/2008 14:08:01 [0:10801]: uid = 0, euid = 0, gid = 0, egid = 0
> 04/23/2008 14:08:01 [0:10801]: using sfd 1
> 04/23/2008 14:08:01 [0:10801]: bound to port 52161
> 04/23/2008 14:08:01 [0:10801]: write_to_qrsh - data =
> 0:52161:/usr/local/grid-6.0/utilbin/lx24-amd64:/mnt/fulcrum/local/ 
> common/grid-6.0/default/spool/ytterbium/active_jobs/ 
> 1157200.1:ytterbium.internal.avlsi.com
> 04/23/2008 14:08:01 [0:10801]: write_to_qrsh - address = niobium:55167
> 04/23/2008 14:08:01 [0:10801]: write_to_qrsh - host = niobium, port  
> = 55167
> 04/23/2008 14:08:01 [0:10801]: error connecting stream socket:
> Connection refused
> 04/23/2008 14:08:01 [0:10801]: communication with qrsh failed
> 04/23/2008 14:08:01 [0:10801]: forked "job" with pid 0
> 04/23/2008 14:08:01 [0:10801]: can't open file job_pid: Permission  
> denied
> 04/23/2008 14:08:01 [0:10801]: write_to_qrsh - data = 1:can't open  
> file
> job_pid: Permission denied
> 04/23/2008 14:08:01 [0:10801]: write_to_qrsh - address = niobium
> 04/23/2008 14:08:01 [0:10801]: illegal value for qrsh_control_port:
> "niobium". Should be host:port
> 04/23/2008 14:08:01 [5143:10800]: wait3 returned 10801 (status: 2816;
> WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 11)
> 04/23/2008 14:08:01 [5143:10800]: job exited with exit status 11
> 04/23/2008 14:08:01 [5143:10800]: reaped "job" with pid 10801
> 04/23/2008 14:08:01 [5143:10800]: job exited not due to signal
> 04/23/2008 14:08:01 [5143:10800]: job exited with status 11
> 04/23/2008 14:08:01 [0:10800]: can't open file
> /scratch/1157200.1.all.q/pid: No such file or directory
> 04/23/2008 14:08:01 [0:10800]: write_to_qrsh - data = 1:can't open  
> file
> /scratch/1157200.1.all.q/pid: No such file or directory
> 04/23/2008 14:08:01 [0:10800]: write_to_qrsh - address = niobium:55167
> 04/23/2008 14:08:01 [0:10800]: write_to_qrsh - host = niobium, port  
> = 55167
> 04/23/2008 14:08:01 [0:10800]: error connecting stream socket:
> Connection refused
>
> Shepherd error:
> 04/23/2008 14:08:01 [0:10801]: can't open file job_pid: Permission  
> denied
> 04/23/2008 14:08:01 [0:10800]: can't open file
> /scratch/1157200.1.all.q/pid: No such file or directory
>
> Shepherd pe_hostfile:
> ytterbium.internal.avlsi.com 1 all.q at ytterbium.internal.avlsi.com  
> <NULL>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list