[GE users] sheperd error can't stat()

McCalla, Mac macmccalla at hess.com
Tue Dec 11 19:49:59 GMT 2007


Hello Tim,  (and Dan)

We have recently experienced some job failures like this.  The "not
accessible from the execution host" reason turned out to be on the nfs
server end (storage vendor involvement required), not the execution host
side.   Just something else to keep in mind. 

mac mccalla  

-----Original Message-----
From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
Sent: Tuesday, December 11, 2007 11:05 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] sheperd error can't stat()

Tim,

The job is being submitted with /proj/nhovxs/nhovx0n/caffeine/nhs as the
output path, but that path is not accessible from the execution host for
some reason.  What happens if you log into that machine as the
submitting user and try to access /proj/nhovxs/nhovx0n/caffeine/nhs?

Daniel

Tim Fennell wrote:
> Hi all,
>  I 've been digging through through the archives looking for an answer

> to error messages I've been receiving but have not found a solution so

> hopefully someone can help.  Here is what I'm getting for errors 
> mailed to me.  Thanks in advance.
>
> Job 9569 caused action: Job 9569 set to ERROR
> User        = nhovx0n
> Queue       = dual1.q at cilea10.channing.harvard.edu
> Host        = cilea10.channing.harvard.edu
> Start Time  = <unknown>
> End Time    = <unknown>
> failed opening input/output file:12/10/2007 22:30:20 [11035:6840]: 
> can't stat() "/proj/nhovxs/nhovx0n/caffeine/nhs" as stdout_path: P 
> Shepherd trace:
> 12/10/2007 22:30:20 [16:6839]: shepherd called with uid = 0, euid = 16
> 12/10/2007 22:30:20 [16:6839]: starting up 6.0u9
> 12/10/2007 22:30:20 [16:6839]: setpgid(6839, 6839) returned 0
> 12/10/2007 22:30:20 [16:6839]: no prolog script to start
> 12/10/2007 22:30:20 [16:6839]: forked "job" with pid 6840
> 12/10/2007 22:30:20 [16:6840]: pid=6840 pgrp=6840 sid=6840 old
> pgrp=6839 getlogin()=<no login set>
> 12/10/2007 22:30:20 [16:6840]: reading passwd information for user 
> 'nhovx0n'
> 12/10/2007 22:30:20 [16:6840]: setosjobid: uid = 0, euid = 16
> 12/10/2007 22:30:20 [16:6839]: child: job - pid: 6840
> 12/10/2007 22:30:20 [16:6840]: setting limits
> 12/10/2007 22:30:20 [16:6840]: RLIMIT_CPU setting: (soft
> 18446744073709551613 hard 18446744073709551613) resulting: (soft
> 18446744073709551613 hard 18446744073709551613)
> 12/10/2007 22:30:20 [16:6840]: RLIMIT_FSIZE setting: (soft 13958643712

> hard 13958643712) resulting: (soft 13958643712 hard 13958643712)
> 12/10/2007 22:30:20 [16:6840]: RLIMIT_DATA setting: (soft
> 18446744073709551613 hard 18446744073709551613) resulting: (soft
> 18446744073709551613 hard 18446744073709551613)
> 12/10/2007 22:30:20 [16:6840]: RLIMIT_STACK setting: (soft
> 18446744073709551613 hard 18446744073709551613) resulting: (soft
> 18446744073709551613 hard 18446744073709551613)
> 12/10/2007 22:30:20 [16:6840]: RLIMIT_CORE setting: (soft
> 18446744073709551613 hard 18446744073709551613) resulting: (soft
> 18446744073709551613 hard 18446744073709551613)
> 12/10/2007 22:30:20 [16:6840]: RLIMIT_VMEM setting: (soft
> 18446744073709551613 hard 18446744073709551613) resulting: (soft
> 18446744073709551613 hard 18446744073709551613)
> 12/10/2007 22:30:20 [16:6840]: setting environment
> 12/10/2007 22:30:20 [16:6840]: Initializing error file
> 12/10/2007 22:30:20 [16:6840]: switching to intermediate/target user
> 12/10/2007 22:30:20 [11035:6840]: closing all filedescriptors
> 12/10/2007 22:30:20 [11035:6840]: further messages are in "error" and 
> "trace"
> 12/10/2007 22:30:20 [11035:6840]: can't stat() 
> "/proj/nhovxs/nhovx0n/caffeine/nhs" as stdout_path: Permission denied 
> KRB5CCNAME=none uid=11035 gid=671 671 20096 12/10/2007 22:30:20
> [16:6839]: wait3 returned 6840 (status: 6656; WIFSIGNALED: 0,
> WIFEXITED: 1, WEXITSTATUS: 26)
> 12/10/2007 22:30:20 [16:6839]: job exited with exit status 26
> 12/10/2007 22:30:20 [16:6839]: reaped "job" with pid 6840
> 12/10/2007 22:30:20 [16:6839]: job exited not due to signal
> 12/10/2007 22:30:20 [16:6839]: job exited with status 26
> 12/10/2007 22:30:20 [16:6839]: now sending signal KILL to pid -6840
> 12/10/2007 22:30:20 [16:6839]: no tasker to notify
> 12/10/2007 22:30:20 [16:6839]: failed starting job
> 12/10/2007 22:30:20 [16:6839]: no epilog script to start
>
> Shepherd error:
> 12/10/2007 22:30:20 [11035:6840]: can't stat() 
> "/proj/nhovxs/nhovx0n/caffeine/nhs" as stdout_path: Permission denied 
> KRB5CCNAME=none uid=11035 gid=671 671 20096 Shepherd pe_hostfile:
> cilea10.channing.harvard.edu 1 dual1.q at cilea10.channing.harvard.edu
> <NULL>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list