[GE users] Inaccurate reporting leading to abort of jobs

Alex Shenfield alex.shenfield at gmail.com
Fri Mar 10 13:26:38 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Andy,

sge_schedd -help tells me that its GE 6.0u7, but I dont know how to find out
whether or not ENABLE_ADDGRP_KILL is activated or not.

How would i find this out?

Thanks for your help,

Alex


On 3/10/06, Andy Schwierskott <andy.schwierskott at sun.com> wrote:
>
> Alex,
>
> probably a massively parallel job:-)
>
> What version are you running? Is it 6.0u7 with ENABLE_ADDGRP_KILL
> activated.
> See sge_conf(5) for more information.
>
> Most likely there is/was a running process from an old job in the systems
> and
> the additonal group id became recycled. The new 6.0u7 ENABLE_ADDGRP_KILL
> is
> a fix for this problem.
>
> Andy
>
> > Hi,
> >
> > I am running a set of simple java programs as an array job.  The java
> > programs take seconds to complete, but one task from the array job often
> > gets aborted.  The reporting information that i get mailed to me from
> sun
> > grid engine is something like:
> >
> > Job-array task 549795.22 (ArrayJobsAlex) Aborted
> > Exit Status      = 137
> > Signal           = KILL
> > User             = alex
> > Queue            = short.q at comp17.iceberg.shef.ac.uk
> > Host             = comp17.iceberg.shef.ac.uk
> > Start Time       = 03/10/2006 11:30:38
> > End Time         = 03/10/2006 11:30:39
> > CPU              = 41:06:21:29
> > Max vmem         = 1.470G
> > failed assumedly after job because:
> > job 549795.22 died through signal KILL (9)
> >
> > According to this error report the CPU usage is 41:06:21:29.  This
> cannot be
> > correct, as you can also see from the email that the job was killed
> after 1
> > second.  My grid engine script has this as the header:
> >
> > #!/bin/sh
> > #$ -l h_cpu=00:60:00
> > #$ -N ArrayJobsAlex
> > #$ -t 1-25:1
> > #$ -M alex.shenfield at gmail.com
> > #$ -m as
> > #$ -e $HOME/$JOB_NAME.e
> > #$ -o $HOME/$JOB_NAME.o
> > #$ -S /bin/bash
> >
> > I have increased the h_cpu time to 1 hour (from 60 seconds) on
> suggesting
> > that the low value may cause sge to check the cpu limit of my job
> > prematurely (before the first accounting info is available), however
> this
> > doesn't seem to have succeded.
> >
> > Can anybody offer a solution or work around to this problem?
> >
> > Thanks for your time,
> >
> > Alex
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>



More information about the gridengine-users mailing list