[GE users] Inaccurate reporting leading to abort of jobs

Alex Shenfield alex.shenfield at gmail.com
Fri Mar 10 12:31:47 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi,

I am running a set of simple java programs as an array job.  The java
programs take seconds to complete, but one task from the array job often
gets aborted.  The reporting information that i get mailed to me from sun
grid engine is something like:

Job-array task 549795.22 (ArrayJobsAlex) Aborted
Exit Status      = 137
Signal           = KILL
User             = alex
Queue            = short.q at comp17.iceberg.shef.ac.uk
Host             = comp17.iceberg.shef.ac.uk
Start Time       = 03/10/2006 11:30:38
End Time         = 03/10/2006 11:30:39
CPU              = 41:06:21:29
Max vmem         = 1.470G
failed assumedly after job because:
job 549795.22 died through signal KILL (9)

According to this error report the CPU usage is 41:06:21:29.  This cannot be
correct, as you can also see from the email that the job was killed after 1
second.  My grid engine script has this as the header:

#!/bin/sh
#$ -l h_cpu=00:60:00
#$ -N ArrayJobsAlex
#$ -t 1-25:1
#$ -M alex.shenfield at gmail.com
#$ -m as
#$ -e $HOME/$JOB_NAME.e
#$ -o $HOME/$JOB_NAME.o
#$ -S /bin/bash

I have increased the h_cpu time to 1 hour (from 60 seconds) on suggesting
that the low value may cause sge to check the cpu limit of my job
prematurely (before the first accounting info is available), however this
doesn't seem to have succeded.

Can anybody offer a solution or work around to this problem?

Thanks for your time,

Alex



More information about the gridengine-users mailing list