[GE users] Inaccurate reporting leading to abort of jobs

Alex Shenfield alex.shenfield at gmail.com
Mon Mar 13 11:33:44 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I have been told by the administrators for our sun grid engine machines that
there is a potential issue effecting parallel jobs that may be caused by
implementing the ENABLE_ADDGRP_KILL parameter.  Can anybody shed any light
on this?

Also, i'm trying specifying wallclock time with h_rt instead of cpu time.
Is wallclock time also inherited with the group id?

Thanks,

Alex




On 3/10/06, McCalla, Mac <macmccalla at hess.com> wrote:
>
> hi Alex,
>
>     Look for the ENABLE_ADDGRP_KILL=true (or 1) string in "qconf -sconf"
> output.  See the man page for sge_conf for lots more info.
>
> mac mccalla
>
>  ------------------------------
> *From:* Alex Shenfield [mailto:alex.shenfield at gmail.com]
> *Sent:* Friday, March 10, 2006 7:27 AM
>
> *To:* users at gridengine.sunsource.net
> *Subject:* Re: [GE users] Inaccurate reporting leading to abort of jobs
>
> Andy,
>
> sge_schedd -help tells me that its GE 6.0u7, but I dont know how to find
> out whether or not ENABLE_ADDGRP_KILL is activated or not.
>
> How would i find this out?
>
> Thanks for your help,
>
> Alex
>
>
> On 3/10/06, Andy Schwierskott <andy.schwierskott at sun.com> wrote:
> >
> > Alex,
> >
> > probably a massively parallel job:-)
> >
> > What version are you running? Is it 6.0u7 with ENABLE_ADDGRP_KILL
> > activated.
> > See sge_conf(5) for more information.
> >
> > Most likely there is/was a running process from an old job in the
> > systems and
> > the additonal group id became recycled. The new 6.0u7 ENABLE_ADDGRP_KILL
> > is
> > a fix for this problem.
> >
> > Andy
> >
> > > Hi,
> > >
> > > I am running a set of simple java programs as an array job.  The java
> > > programs take seconds to complete, but one task from the array job
> > often
> > > gets aborted.  The reporting information that i get mailed to me from
> > sun
> > > grid engine is something like:
> > >
> > > Job-array task 549795.22 (ArrayJobsAlex) Aborted
> > > Exit Status      = 137
> > > Signal           = KILL
> > > User             = alex
> > > Queue            = short.q at comp17.iceberg.shef.ac.uk
> > > Host             = comp17.iceberg.shef.ac.uk
> > > Start Time       = 03/10/2006 11:30:38
> > > End Time         = 03/10/2006 11:30:39
> > > CPU              = 41:06:21:29
> > > Max vmem         = 1.470G
> > > failed assumedly after job because:
> > > job 549795.22 died through signal KILL (9)
> > >
> > > According to this error report the CPU usage is 41:06:21:29.  This
> > cannot be
> > > correct, as you can also see from the email that the job was killed
> > after 1
> > > second.  My grid engine script has this as the header:
> > >
> > > #!/bin/sh
> > > #$ -l h_cpu=00:60:00
> > > #$ -N ArrayJobsAlex
> > > #$ -t 1-25:1
> > > #$ -M alex.shenfield at gmail.com
> > > #$ -m as
> > > #$ -e $HOME/$JOB_NAME.e
> > > #$ -o $HOME/$JOB_NAME.o
> > > #$ -S /bin/bash
> > >
> > > I have increased the h_cpu time to 1 hour (from 60 seconds) on
> > suggesting
> > > that the low value may cause sge to check the cpu limit of my job
> > > prematurely (before the first accounting info is available), however
> > this
> > > doesn't seem to have succeded.
> > >
> > > Can anybody offer a solution or work around to this problem?
> > >
> > > Thanks for your time,
> > >
> > > Alex
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
>



More information about the gridengine-users mailing list