[GE users] Inaccurate reporting leading to abort of jobs

Andy Schwierskott andy.schwierskott at sun.com
Fri Mar 10 12:36:24 GMT 2006


Alex,

probably a massively parallel job:-)

What version are you running? Is it 6.0u7 with ENABLE_ADDGRP_KILL activated.
See sge_conf(5) for more information.

Most likely there is/was a running process from an old job in the systems and
the additonal group id became recycled. The new 6.0u7 ENABLE_ADDGRP_KILL is
a fix for this problem.

Andy

> Hi,
>
> I am running a set of simple java programs as an array job.  The java
> programs take seconds to complete, but one task from the array job often
> gets aborted.  The reporting information that i get mailed to me from sun
> grid engine is something like:
>
> Job-array task 549795.22 (ArrayJobsAlex) Aborted
> Exit Status      = 137
> Signal           = KILL
> User             = alex
> Queue            = short.q at comp17.iceberg.shef.ac.uk
> Host             = comp17.iceberg.shef.ac.uk
> Start Time       = 03/10/2006 11:30:38
> End Time         = 03/10/2006 11:30:39
> CPU              = 41:06:21:29
> Max vmem         = 1.470G
> failed assumedly after job because:
> job 549795.22 died through signal KILL (9)
>
> According to this error report the CPU usage is 41:06:21:29.  This cannot be
> correct, as you can also see from the email that the job was killed after 1
> second.  My grid engine script has this as the header:
>
> #!/bin/sh
> #$ -l h_cpu=00:60:00
> #$ -N ArrayJobsAlex
> #$ -t 1-25:1
> #$ -M alex.shenfield at gmail.com
> #$ -m as
> #$ -e $HOME/$JOB_NAME.e
> #$ -o $HOME/$JOB_NAME.o
> #$ -S /bin/bash
>
> I have increased the h_cpu time to 1 hour (from 60 seconds) on suggesting
> that the low value may cause sge to check the cpu limit of my job
> prematurely (before the first accounting info is available), however this
> doesn't seem to have succeded.
>
> Can anybody offer a solution or work around to this problem?
>
> Thanks for your time,
>
> Alex
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list