[GE users] Erronous job execution

Andreas Haas Andreas.Haas at Sun.COM
Wed Apr 12 12:41:28 BST 2006


Hi Ikmal,

it tells you shepherd "failed before writing exit_status".
This could mean there was an error condition shepherd could
not handle. From shepherd's trace file output I can't assess
what might have caused this.

But you can do the following to figure out what happens

(1) Use the 'KEEP_ACTIVE' execd_params setting in sge_conf(5)
    for that particular host to prevent the 'active_jobs'
    directory be removed after job run.
(2) Shutdown the execd on that host using qconf -ke <host>.
(3) As user "root" change into the job's active directory

(4) Make sure only those files exist in the directory
    which get written by execd before it launches the shepherd.
    The files are: "config", "environment" and "pe_hostfile"
(5) Start the shepherd binary as user "root" like execd does.
(6) Wait and see what happens

note you can also start shepherd under control of dbx/gdb or
truss/strace. You can repeat this if you start over at (4).

Keep in mind to switch 'KEEP_ACTIVE' off once you're done
with your diagnosis!

Regards,
Andreas

On Wed, 12 Apr 2006, Hairul Ikmal Mohamad Fuzi wrote:

> Hi everyone,
>
> We have been running a program called MCNP (Monte Carlo N-Particle)
> through SGE for quite sometime. Lately, the execution thorugh SGE was
> erronous. Does anyone have any idea what actually happens because we
> kept receiving this error through email (see below) every time we
> submit an MCNP job ? Having said, at first we thought it was caused by
> an errounous input file, unfortunately, it wasn't as I have checked
> the input file with the application sitting in another PC.
>
> TIA.
>
> - Ikmal
>
> ==============================
> Job 155 caused action: none
>  User        = seang
>  Queue       = all.q at hptc.local
>  Host        = hptc.local
>  Start Time  = <unknown>
>  End Time    = <unknown>
> failed before writing exit_status:shepherd exited with exit status 19
> Shepherd trace:
> 03/30/2006 09:49:50 [400:1214]: shepherd called with uid = 0, euid = 400
> 03/30/2006 09:49:50 [400:1214]: starting up 6.0u6
> 03/30/2006 09:49:50 [400:1214]: setpgid(1214, 1214) returned 0
> 03/30/2006 09:49:50 [400:1214]: no prolog script to start
> 03/30/2006 09:49:50 [400:1217]: pid=1217 pgrp=1217 sid=1217 old
> pgrp=1214 getlogin()=<no login set>
> 03/30/2006 09:49:50 [400:1217]: reading passwd information for user 'seang'
> 03/30/2006 09:49:50 [400:1217]: setosjobid: uid = 0, euid = 400
> 03/30/2006 09:49:50 [400:1217]: setting limits
> 03/30/2006 09:49:50 [400:1217]: RLIMIT_CPU setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 03/30/2006 09:49:50 [400:1217]: RLIMIT_FSIZE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 03/30/2006 09:49:50 [400:1217]: RLIMIT_DATA setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 03/30/2006 09:49:50 [400:1217]: RLIMIT_STACK setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 03/30/2006 09:49:50 [400:1217]: RLIMIT_CORE setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 03/30/2006 09:49:50 [400:1217]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 03/30/2006 09:49:50 [400:1217]: RLIMIT_RSS setting: (soft
> 18446744073709551615 hard 18446744073709551615) resulting: (soft
> 18446744073709551615 hard 18446744073709551615)
> 03/30/2006 09:49:50 [400:1217]: setting environment
> 03/30/2006 09:49:50 [400:1217]: Initializing error file
> 03/30/2006 09:49:50 [400:1217]: now doing chown(seang) of trace and error files
> 03/30/2006 09:49:50 [400:1217]: switching to intermediate/target user
> 03/30/2006 09:49:50 [511:1217]: now running with uid=511, euid=511
> 03/30/2006 09:49:50 [511:1217]: closing all filedescriptors
> 03/30/2006 09:49:50 [511:1217]: further messages are in "error" and "trace"
> 03/30/2006 09:49:50 [400:1214]: forked "job" with pid 1217
> 03/30/2006 09:49:50 [400:1214]: child: job - pid: 1217
> 03/30/2006 09:49:50 [511:1217]: using stdout as stderr
> 03/30/2006 09:49:50 [511:1217]: now running with uid=511, euid=511
> 03/30/2006 09:49:50 [511:1217]: execvp(/bin/bash, "-bash"
> "/opt/gridengine/default/spool/hptc/job_scripts/155")
>
> Shepherd pe_hostfile:
> hptc.local 1 all.q at hptc.local <NULL>
> ==============================
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list