[GE users] qdel with array task dependencies problem

dangruhn Dan.Gruhn at groupw.com
Wed Apr 14 19:01:21 BST 2010


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Right.

On 04/14/2010 12:24 PM, reuti wrote:

Hi,

so, in short: the problem is that some tasks start to run which are supposed to be killed already - right?

-- Reuti


Am 13.04.2010 um 20:17 schrieb dangruhn:



Greetings,

I have come across what appears to be an error in SGE 2.6u5 on amd64 Linux Fedora 9 having to do with the -hold-jid-ad array dependencies and qdel.  In trying to simplify a reproduce, I have come down to this:

1) I have a queue which can run at most 3 jobs
2) I submit an array task with 9 tasks and then another 9 tasks with an array dependency on the first 9. The script "trial" does this as follows (I am user dgruhn):

#!/bin/bash

echo "Queueing out jobs"
qsub -q test.q -N tempjob-first -t 1-9 trial.job first
qsub -q test.q -hold_jid_ad tempjob-first -N tempjob-second -t 1-9 trial.job second

sleep 20

echo "Deleting jobs"
qdel -u dgruhn 'tempjob-*'

The script "trial.job" (below) echos out the time and its task number, waits 40 seconds, does it again and then exits.

#!/bin/bash

# If SGE is was not given a rep number
: ${SGE_TASK_ID=1}
if [ "$SGE_TASK_ID" = "undefined" ]
then
    SGE_TASK_ID=1
fi

echo "`date` I am starting $1 task - $SGE_TASK_ID"
sleep 40
echo "`date` I am ending $1 task - $SGE_TASK_ID"

This is what happens:

1) The tasks 1-3 of tempjob-first start to run.

2a) About 20 seconds later tasks 1-3 of tempjob-first are killed are killed and aborted (I get email notification)

Job-array task 205030.3 (tempjob-first) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            = test.q at dgruhn-f9.group-w-inc.com<mailto:test.q at dgruhn-f9.group-w-inc.com>
 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 13:59:22
 End Time         = 04/13/2010 13:59:40
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:
job 205030.3 died through signal KILL (9)

2b) Also at that time, the remaining 4-9 of tempjob-first and tasks 1-3 of tempjob-second are deleted from the pending area.

2c) Additionally, tasks 4-6 of tempjob-second start to run.

3) About 30 seconds later, tasks 4-6 of tempjob-second are aborted, and tasks 7-9 of tempjob-second start to run.

Job-array task 205031.4 (tempjob-second) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            =
test.q at dgruhn-f9.group-w-inc.com<mailto:test.q at dgruhn-f9.group-w-inc.com>

 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 13:59:42
 End Time         = 04/13/2010 14:00:13
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:

job 205031.4 died through signal KILL (9)


4) In approximately another 30 seconds, tasks 7-9 of tempjob-second are aborted.

Job-array task 205031.7 (tempjob-second) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            =
test.q at dgruhn-f9.group-w-inc.com<mailto:test.q at dgruhn-f9.group-w-inc.com>

 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 14:00:15
 End Time         = 04/13/2010 14:00:46
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:
job 205031.7 died through signal KILL (9)


This is the output I get to my terminal when running "trial"

Queueing out jobs
Your job-array 205030.1-9:1 ("tempjob-first") has been submitted
Your job-array 205031.1-9:1 ("tempjob-second") has been submitted
Deleting jobs
dgruhn has registered the job-array task 205030.1 for deletion
dgruhn has registered the job-array task 205030.2 for deletion
dgruhn has registered the job-array task 205030.3 for deletion
dgruhn has deleted job 205030
dgruhn has deleted job 205031

This is the output left in my login directory *.o* files (nothing in the *.e* files):

Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 1
Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 2
Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 3
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 4
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 5
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 6
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 7
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 8
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 9


I have the scheduler interval set to 10 seconds and my notify time on the test.q is 20 seconds.  I have varied both of these around and neither seem to change this odd 31 second runtime. I have also tried catching the USR1 and USR2 signals in the "trial.job" script, but that doesn't make a difference either.

I my real life situation, the second set of tasks running have problems because they expect to find the output of the first set of tasks.

Should not tasks 4-7 of tempjob-second just be deleted? Why are they all running for about 31 seconds and then being aborted?

Does anyone have any ideas?

Dan



------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253399

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net<mailto:users-unsubscribe at gridengine.sunsource.net>].




More information about the gridengine-users mailing list