[GE users] Inaccurate reporting leading to abort of jobs

Reuti reuti at staff.uni-marburg.de
Tue Mar 14 08:03:33 GMT 2006


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Alex,

Quoting Alex Shenfield <alex.shenfield at gmail.com>:

> Hi,
>
> I have been told by the administrators for our sun grid engine machines that
> there is a potential issue effecting parallel jobs that may be caused by
> implementing the ENABLE_ADDGRP_KILL parameter.  Can anybody shed any light
> on this?

some startup methods will start daemons to built up the parallel environment
(LAM/MPI, PVM, ...). In this case the finished qrsh will leave the 
machine, and
the started daemons might also be killed.

But I think the main problem is LAM/MPI with its two-stage startup, for others
it might work.

>
> Also, i'm trying specifying wallclock time with h_rt instead of cpu time.
> Is wallclock time also inherited with the group id?

Are you facing processes, which are no longer bound to the shepherd?

Try: ps -e f

Cheers - Reuti

>
> Thanks,
>
> Alex
>
>
>
>
> On 3/10/06, McCalla, Mac <macmccalla at hess.com> wrote:
>>
>> hi Alex,
>>
>>     Look for the ENABLE_ADDGRP_KILL=true (or 1) string in "qconf -sconf"
>> output.  See the man page for sge_conf for lots more info.
>>
>> mac mccalla
>>
>>  ------------------------------
>> *From:* Alex Shenfield [mailto:alex.shenfield at gmail.com]
>> *Sent:* Friday, March 10, 2006 7:27 AM
>>
>> *To:* users at gridengine.sunsource.net
>> *Subject:* Re: [GE users] Inaccurate reporting leading to abort of jobs
>>
>> Andy,
>>
>> sge_schedd -help tells me that its GE 6.0u7, but I dont know how to find
>> out whether or not ENABLE_ADDGRP_KILL is activated or not.
>>
>> How would i find this out?
>>
>> Thanks for your help,
>>
>> Alex
>>
>>
>> On 3/10/06, Andy Schwierskott <andy.schwierskott at sun.com> wrote:
>> >
>> > Alex,
>> >
>> > probably a massively parallel job:-)
>> >
>> > What version are you running? Is it 6.0u7 with ENABLE_ADDGRP_KILL
>> > activated.
>> > See sge_conf(5) for more information.
>> >
>> > Most likely there is/was a running process from an old job in the
>> > systems and
>> > the additonal group id became recycled. The new 6.0u7 ENABLE_ADDGRP_KILL
>> > is
>> > a fix for this problem.
>> >
>> > Andy
>> >
>> > > Hi,
>> > >
>> > > I am running a set of simple java programs as an array job.  The java
>> > > programs take seconds to complete, but one task from the array job
>> > often
>> > > gets aborted.  The reporting information that i get mailed to me from
>> > sun
>> > > grid engine is something like:
>> > >
>> > > Job-array task 549795.22 (ArrayJobsAlex) Aborted
>> > > Exit Status      = 137
>> > > Signal           = KILL
>> > > User             = alex
>> > > Queue            = short.q at comp17.iceberg.shef.ac.uk
>> > > Host             = comp17.iceberg.shef.ac.uk
>> > > Start Time       = 03/10/2006 11:30:38
>> > > End Time         = 03/10/2006 11:30:39
>> > > CPU              = 41:06:21:29
>> > > Max vmem         = 1.470G
>> > > failed assumedly after job because:
>> > > job 549795.22 died through signal KILL (9)
>> > >
>> > > According to this error report the CPU usage is 41:06:21:29.  This
>> > cannot be
>> > > correct, as you can also see from the email that the job was killed
>> > after 1
>> > > second.  My grid engine script has this as the header:
>> > >
>> > > #!/bin/sh
>> > > #$ -l h_cpu=00:60:00
>> > > #$ -N ArrayJobsAlex
>> > > #$ -t 1-25:1
>> > > #$ -M alex.shenfield at gmail.com
>> > > #$ -m as
>> > > #$ -e $HOME/$JOB_NAME.e
>> > > #$ -o $HOME/$JOB_NAME.o
>> > > #$ -S /bin/bash
>> > >
>> > > I have increased the h_cpu time to 1 hour (from 60 seconds) on
>> > suggesting
>> > > that the low value may cause sge to check the cpu limit of my job
>> > > prematurely (before the first accounting info is available), however
>> > this
>> > > doesn't seem to have succeded.
>> > >
>> > > Can anybody offer a solution or work around to this problem?
>> > >
>> > > Thanks for your time,
>> > >
>> > > Alex
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >
>> >
>>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list