[GE users] Inaccurate reporting leading to abort of jobs

McCalla, Mac macmccalla at hess.com
Fri Mar 10 14:49:45 GMT 2006


hi Alex,
 
    Look for the ENABLE_ADDGRP_KILL=true (or 1) string in "qconf -sconf"
output.  See the man page for sge_conf for lots more info.
 
mac mccalla 

  _____  

From: Alex Shenfield [mailto:alex.shenfield at gmail.com] 
Sent: Friday, March 10, 2006 7:27 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Inaccurate reporting leading to abort of jobs


Andy,

sge_schedd -help tells me that its GE 6.0u7, but I dont know how to find
out whether or not ENABLE_ADDGRP_KILL is activated or not.

How would i find this out?

Thanks for your help,

Alex



On 3/10/06, Andy Schwierskott <andy.schwierskott at sun.com> wrote: 

	Alex,
	
	probably a massively parallel job:-)
	
	What version are you running? Is it 6.0u7 with
ENABLE_ADDGRP_KILL activated.
	See sge_conf(5) for more information.
	
	Most likely there is/was a running process from an old job in
the systems and 
	the additonal group id became recycled. The new 6.0u7
ENABLE_ADDGRP_KILL is
	a fix for this problem.
	
	Andy
	
	> Hi,
	>
	> I am running a set of simple java programs as an array job.
The java
	> programs take seconds to complete, but one task from the array
job often
	> gets aborted.  The reporting information that i get mailed to
me from sun
	> grid engine is something like:
	>
	> Job-array task 549795.22 (ArrayJobsAlex) Aborted
	> Exit Status      = 137
	> Signal           = KILL
	> User             = alex
	> Queue            = short.q at comp17.iceberg.shef.ac.uk 
	> Host             = comp17.iceberg.shef.ac.uk
	> Start Time       = 03/10/2006 11:30:38
	> End Time         = 03/10/2006 11:30:39
	> CPU              = 41:06:21:29 
	> Max vmem         = 1.470G
	> failed assumedly after job because:
	> job 549795.22 died through signal KILL (9)
	>
	> According to this error report the CPU usage is 41:06:21:29.
This cannot be 
	> correct, as you can also see from the email that the job was
killed after 1
	> second.  My grid engine script has this as the header:
	>
	> #!/bin/sh
	> #$ -l h_cpu=00:60:00
	> #$ -N ArrayJobsAlex 
	> #$ -t 1-25:1
	> #$ -M alex.shenfield at gmail.com
	> #$ -m as
	> #$ -e $HOME/$JOB_NAME.e
	> #$ -o $HOME/$JOB_NAME.o
	> #$ -S /bin/bash
	>
	> I have increased the h_cpu time to 1 hour (from 60 seconds) on
suggesting
	> that the low value may cause sge to check the cpu limit of my
job
	> prematurely (before the first accounting info is available),
however this 
	> doesn't seem to have succeded.
	>
	> Can anybody offer a solution or work around to this problem?
	>
	> Thanks for your time,
	>
	> Alex
	>
	
	
--------------------------------------------------------------------- 
	To unsubscribe, e-mail:
users-unsubscribe at gridengine.sunsource.net
	For additional commands, e-mail:
users-help at gridengine.sunsource.net
	
	





More information about the gridengine-users mailing list