[GE users] qdel with array task dependencies problem

reuti reuti at staff.uni-marburg.de
Wed Apr 14 17:24:55 BST 2010


Hi,

so, in short: the problem is that some tasks start to run which are supposed to be killed already - right?

-- Reuti


Am 13.04.2010 um 20:17 schrieb dangruhn:

> Greetings,
> 
> I have come across what appears to be an error in SGE 2.6u5 on amd64 Linux Fedora 9 having to do with the -hold-jid-ad array dependencies and qdel.  In trying to simplify a reproduce, I have come down to this:
> 
> 1) I have a queue which can run at most 3 jobs
> 2) I submit an array task with 9 tasks and then another 9 tasks with an array dependency on the first 9. The script "trial" does this as follows (I am user dgruhn):
> 
> #!/bin/bash
> 
> echo "Queueing out jobs"
> qsub -q test.q -N tempjob-first -t 1-9 trial.job first
> qsub -q test.q -hold_jid_ad tempjob-first -N tempjob-second -t 1-9 trial.job second
> 
> sleep 20
> 
> echo "Deleting jobs"
> qdel -u dgruhn 'tempjob-*'
> 
> The script "trial.job" (below) echos out the time and its task number, waits 40 seconds, does it again and then exits.
> 
> #!/bin/bash
> 
> # If SGE is was not given a rep number
> : ${SGE_TASK_ID=1}
> if [ "$SGE_TASK_ID" = "undefined" ]
> then
>     SGE_TASK_ID=1
> fi
> 
> echo "`date` I am starting $1 task - $SGE_TASK_ID"
> sleep 40
> echo "`date` I am ending $1 task - $SGE_TASK_ID"
> 
> This is what happens:
> 
> 1) The tasks 1-3 of tempjob-first start to run.  
> 
> 2a) About 20 seconds later tasks 1-3 of tempjob-first are killed are killed and aborted (I get email notification)
> 
> Job-array task 205030.3 (tempjob-first) Aborted
>  Exit Status      = 137
>  Signal           = KILL
>  User             = dgruhn
>  Queue            = test.q at dgruhn-f9.group-w-inc.com
>  Host             = dgruhn-f9.group-w-inc.com
>  Start Time       = 04/13/2010 13:59:22
>  End Time         = 04/13/2010 13:59:40
>  CPU              = 00:00:00
>  Max vmem         = 168.094M
> failed assumedly after job because:
> job 205030.3 died through signal KILL (9)
> 
> 2b) Also at that time, the remaining 4-9 of tempjob-first and tasks 1-3 of tempjob-second are deleted from the pending area.
> 
> 2c) Additionally, tasks 4-6 of tempjob-second start to run.
> 
> 3) About 30 seconds later, tasks 4-6 of tempjob-second are aborted, and tasks 7-9 of tempjob-second start to run.
>  
> Job-array task 205031.4 (tempjob-second) Aborted
>  Exit Status      = 137
>  Signal           = KILL
>  User             = dgruhn
>  Queue            = 
> test.q at dgruhn-f9.group-w-inc.com
> 
>  Host             = dgruhn-f9.group-w-inc.com
>  Start Time       = 04/13/2010 13:59:42
>  End Time         = 04/13/2010 14:00:13
>  CPU              = 00:00:00
>  Max vmem         = 168.094M
> failed assumedly after job because:
> 
> job 205031.4 died through signal KILL (9)
> 
> 
> 4) In approximately another 30 seconds, tasks 7-9 of tempjob-second are aborted.
>  
> Job-array task 205031.7 (tempjob-second) Aborted
>  Exit Status      = 137
>  Signal           = KILL
>  User             = dgruhn
>  Queue            = 
> test.q at dgruhn-f9.group-w-inc.com
> 
>  Host             = dgruhn-f9.group-w-inc.com
>  Start Time       = 04/13/2010 14:00:15
>  End Time         = 04/13/2010 14:00:46
>  CPU              = 00:00:00
>  Max vmem         = 168.094M
> failed assumedly after job because:
> job 205031.7 died through signal KILL (9)
> 
> 
> This is the output I get to my terminal when running "trial"
> 
> Queueing out jobs
> Your job-array 205030.1-9:1 ("tempjob-first") has been submitted
> Your job-array 205031.1-9:1 ("tempjob-second") has been submitted
> Deleting jobs
> dgruhn has registered the job-array task 205030.1 for deletion
> dgruhn has registered the job-array task 205030.2 for deletion
> dgruhn has registered the job-array task 205030.3 for deletion
> dgruhn has deleted job 205030
> dgruhn has deleted job 205031
> 
> This is the output left in my login directory *.o* files (nothing in the *.e* files):
> 
> Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 1
> Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 2
> Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 3
> Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 4
> Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 5
> Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 6
> Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 7
> Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 8
> Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 9
> 
> 
> I have the scheduler interval set to 10 seconds and my notify time on the test.q is 20 seconds.  I have varied both of these around and neither seem to change this odd 31 second runtime. I have also tried catching the USR1 and USR2 signals in the "trial.job" script, but that doesn't make a difference either.
> 
> I my real life situation, the second set of tasks running have problems because they expect to find the output of the first set of tasks.
> 
> Should not tasks 4-7 of tempjob-second just be deleted? Why are they all running for about 31 seconds and then being aborted?
> 
> Does anyone have any ideas? 
> 
> Dan

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253399

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list