[GE users] qdel with array task dependencies problem

dangruhn Dan.Gruhn at groupw.com
Thu Apr 15 14:10:43 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

I also tried qalter -hu tempjob-second with no change.

I have filed this as issue 3263 <http://gridengine.sunsource.net/issues/show_bug.cgi?id=3263><http://gridengine.sunsource.net/issues/show_bug.cgi?id=3263>.

Dan

On 04/15/2010 07:47 AM, reuti wrote:

Am 14.04.2010 um 20:08 schrieb dangruhn:



Yes, the odd thing is that the "second" tasks whose dependent "first" task has already started running are deleted, along with ALL of the "first" tasks. Then the remaining "second" tasks area allowed to run, only to be killed after about 30 seconds.



I suggest to file a bug. The bug is not in the in the array task dependence per se I think, but in the behavior of SGE commands. Whether an array task is waiting or executing - it should be deleted.

I tried a workaround:

qalter -h u tempjob-second

before the qdel. Unfortunately it's also not working - no hold for these jobs are applied. It looks like jobs with a -hold_jid_ad are somehow neglected by SGE commands in different ways.

-- Reuti




Dan

On 04/14/2010 12:24 PM, reuti wrote:


Hi,

so, in short: the problem is that some tasks start to run which are supposed to be killed already - right?

-- Reuti


Am 13.04.2010 um 20:17 schrieb dangruhn:





Greetings,

I have come across what appears to be an error in SGE 2.6u5 on amd64 Linux Fedora 9 having to do with the -hold-jid-ad array dependencies and qdel.  In trying to simplify a reproduce, I have come down to this:

1) I have a queue which can run at most 3 jobs
2) I submit an array task with 9 tasks and then another 9 tasks with an array dependency on the first 9. The script "trial" does this as follows (I am user dgruhn):

#!/bin/bash

echo "Queueing out jobs"
qsub -q test.q -N tempjob-first -t 1-9 trial.job first
qsub -q test.q -hold_jid_ad tempjob-first -N tempjob-second -t 1-9 trial.job second

sleep 20

echo "Deleting jobs"
qdel -u dgruhn 'tempjob-*'

The script "trial.job" (below) echos out the time and its task number, waits 40 seconds, does it again and then exits.

#!/bin/bash

# If SGE is was not given a rep number
: ${SGE_TASK_ID=1}
if [ "$SGE_TASK_ID" = "undefined" ]
then
    SGE_TASK_ID=1
fi

echo "`date` I am starting $1 task - $SGE_TASK_ID"
sleep 40
echo "`date` I am ending $1 task - $SGE_TASK_ID"

This is what happens:

1) The tasks 1-3 of tempjob-first start to run.

2a) About 20 seconds later tasks 1-3 of tempjob-first are killed are killed and aborted (I get email notification)

Job-array task 205030.3 (tempjob-first) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            =
test.q at dgruhn-f9.group-w-inc.com<mailto:test.q at dgruhn-f9.group-w-inc.com>

 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 13:59:22
 End Time         = 04/13/2010 13:59:40
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:
job 205030.3 died through signal KILL (9)

2b) Also at that time, the remaining 4-9 of tempjob-first and tasks 1-3 of tempjob-second are deleted from the pending area.

2c) Additionally, tasks 4-6 of tempjob-second start to run.

3) About 30 seconds later, tasks 4-6 of tempjob-second are aborted, and tasks 7-9 of tempjob-second start to run.

Job-array task 205031.4 (tempjob-second) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            =

test.q at dgruhn-f9.group-w-inc.com<mailto:test.q at dgruhn-f9.group-w-inc.com>


 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 13:59:42
 End Time         = 04/13/2010 14:00:13
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:

job 205031.4 died through signal KILL (9)


4) In approximately another 30 seconds, tasks 7-9 of tempjob-second are aborted.

Job-array task 205031.7 (tempjob-second) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            =

test.q at dgruhn-f9.group-w-inc.com<mailto:test.q at dgruhn-f9.group-w-inc.com>


 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 14:00:15
 End Time         = 04/13/2010 14:00:46
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:
job 205031.7 died through signal KILL (9)


This is the output I get to my terminal when running "trial"

Queueing out jobs
Your job-array 205030.1-9:1 ("tempjob-first") has been submitted
Your job-array 205031.1-9:1 ("tempjob-second") has been submitted
Deleting jobs
dgruhn has registered the job-array task 205030.1 for deletion
dgruhn has registered the job-array task 205030.2 for deletion
dgruhn has registered the job-array task 205030.3 for deletion
dgruhn has deleted job 205030
dgruhn has deleted job 205031

This is the output left in my login directory *.o* files (nothing in the *.e* files):

Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 1
Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 2
Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 3
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 4
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 5
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 6
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 7
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 8
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 9


I have the scheduler interval set to 10 seconds and my notify time on the test.q is 20 seconds.  I have varied both of these around and neither seem to change this odd 31 second runtime. I have also tried catching the USR1 and USR2 signals in the "trial.job" script, but that doesn't make a difference either.

I my real life situation, the second set of tasks running have problems because they expect to find the output of the first set of tasks.

Should not tasks 4-7 of tempjob-second just be deleted? Why are they all running for about 31 seconds and then being aborted?

Does anyone have any ideas?

Dan




------------------------------------------------------

http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253399


To unsubscribe from this discussion, e-mail: [
users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>
].





------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253513

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].




More information about the gridengine-users mailing list