[GE users] Erronous job execution

Hairul Ikmal Mohamad Fuzi hairul.ikmal at gmail.com
Sat Apr 15 05:14:52 BST 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Andreas,

Thanks for the reply.

Sorry to say that I'm not very clear about this shepherd thingy.
I would appreciate if somebody can explain to me on 'What is  shepherd
in SGE terms?' and what does shepherd generally do in SGE?

Regarding your suggestions,
3) job's active directory : is it the directory where the user's put
their job script or it is somewhere in the spool directory ?
5) How do I start/What command should I use to start the shepherd
using user 'root' ?

And..I'm just wondering .. is this a software/config based error or is
there any possibility that kind kind of error is caused by hardware
failure?

Just FYI, I'm using SGE (v6.something) which comes together with Rocks
4.1 Linux Cluster Distribution.


Thanks again!

On 4/12/06, Andreas Haas <Andreas.Haas at sun.com> wrote:
> Hi Ikmal,
>
> it tells you shepherd "failed before writing exit_status".
> This could mean there was an error condition shepherd could
> not handle. From shepherd's trace file output I can't assess
> what might have caused this.
>
> But you can do the following to figure out what happens
>
> (1) Use the 'KEEP_ACTIVE' execd_params setting in sge_conf(5)
>     for that particular host to prevent the 'active_jobs'
>     directory be removed after job run.
> (2) Shutdown the execd on that host using qconf -ke <host>.
> (3) As user "root" change into the job's active directory
>
> (4) Make sure only those files exist in the directory
>     which get written by execd before it launches the shepherd.
>     The files are: "config", "environment" and "pe_hostfile"
> (5) Start the shepherd binary as user "root" like execd does.
> (6) Wait and see what happens
>
> note you can also start shepherd under control of dbx/gdb or
> truss/strace. You can repeat this if you start over at (4).
>
> Keep in mind to switch 'KEEP_ACTIVE' off once you're done
> with your diagnosis!
>
> Regards,
> Andreas
>
> On Wed, 12 Apr 2006, Hairul Ikmal Mohamad Fuzi wrote:
>
> > Hi everyone,
> >
> > We have been running a program called MCNP (Monte Carlo N-Particle)
> > through SGE for quite sometime. Lately, the execution thorugh SGE was
> > erronous. Does anyone have any idea what actually happens because we
> > kept receiving this error through email (see below) every time we
> > submit an MCNP job ? Having said, at first we thought it was caused by
> > an errounous input file, unfortunately, it wasn't as I have checked
> > the input file with the application sitting in another PC.
> >
> > TIA.
> >
> > - Ikmal
> >
> > ==============================
> > Job 155 caused action: none
> >  User        = seang
> >  Queue       = all.q at hptc.local
> >  Host        = hptc.local
> >  Start Time  = <unknown>
> >  End Time    = <unknown>
> > failed before writing exit_status:shepherd exited with exit status 19
> > Shepherd trace:
> > 03/30/2006 09:49:50 [400:1214]: shepherd called with uid = 0, euid = 400
> > 03/30/2006 09:49:50 [400:1214]: starting up 6.0u6
> > 03/30/2006 09:49:50 [400:1214]: setpgid(1214, 1214) returned 0
> > 03/30/2006 09:49:50 [400:1214]: no prolog script to start
> > 03/30/2006 09:49:50 [400:1217]: pid=1217 pgrp=1217 sid=1217 old
> > pgrp=1214 getlogin()=<no login set>
> > 03/30/2006 09:49:50 [400:1217]: reading passwd information for user 'seang'
> > 03/30/2006 09:49:50 [400:1217]: setosjobid: uid = 0, euid = 400
> > 03/30/2006 09:49:50 [400:1217]: setting limits
> > 03/30/2006 09:49:50 [400:1217]: RLIMIT_CPU setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> > 03/30/2006 09:49:50 [400:1217]: RLIMIT_FSIZE setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> > 03/30/2006 09:49:50 [400:1217]: RLIMIT_DATA setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> > 03/30/2006 09:49:50 [400:1217]: RLIMIT_STACK setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> > 03/30/2006 09:49:50 [400:1217]: RLIMIT_CORE setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> > 03/30/2006 09:49:50 [400:1217]: RLIMIT_VMEM/RLIMIT_AS setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> > 03/30/2006 09:49:50 [400:1217]: RLIMIT_RSS setting: (soft
> > 18446744073709551615 hard 18446744073709551615) resulting: (soft
> > 18446744073709551615 hard 18446744073709551615)
> > 03/30/2006 09:49:50 [400:1217]: setting environment
> > 03/30/2006 09:49:50 [400:1217]: Initializing error file
> > 03/30/2006 09:49:50 [400:1217]: now doing chown(seang) of trace and error files
> > 03/30/2006 09:49:50 [400:1217]: switching to intermediate/target user
> > 03/30/2006 09:49:50 [511:1217]: now running with uid=511, euid=511
> > 03/30/2006 09:49:50 [511:1217]: closing all filedescriptors
> > 03/30/2006 09:49:50 [511:1217]: further messages are in "error" and "trace"
> > 03/30/2006 09:49:50 [400:1214]: forked "job" with pid 1217
> > 03/30/2006 09:49:50 [400:1214]: child: job - pid: 1217
> > 03/30/2006 09:49:50 [511:1217]: using stdout as stderr
> > 03/30/2006 09:49:50 [511:1217]: now running with uid=511, euid=511
> > 03/30/2006 09:49:50 [511:1217]: execvp(/bin/bash, "-bash"
> > "/opt/gridengine/default/spool/hptc/job_scripts/155")
> >
> > Shepherd pe_hostfile:
> > hptc.local 1 all.q at hptc.local <NULL>
> > ==============================
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list