Opened 11 years ago
Last modified 10 years ago
#801 new defect
IZ3263: Qdel With Array Task Dependencies Problem
Reported by: | dangruhn | Owned by: | |
---|---|---|---|
Priority: | high | Milestone: | |
Component: | sge | Version: | 6.2u5 |
Severity: | Keywords: | PC Linux scheduling | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3263]
Issue #: 3263 Platform: PC Reporter: dangruhn (dangruhn) Component: gridengine OS: Linux Subcomponent: scheduling Version: 6.2u5 CC: [_] reuti [_] Remove selected CCs Status: NEW Priority: P2 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: Qdel With Array Task Dependencies Problem Status whiteboard: Attachments: Issue 3263 blocks: Votes for issue 3263: Opened: Thu Apr 15 06:08:00 -0700 2010 ------------------------ I have come across an error in SGE 2.6u5 on amd64 Linux Fedora 9 (info on VMware Solaris 10 i386 farther on down) having to do with the -hold-jid-ad array dependencies and qdel. In trying to simplify a reproduce, I have come down to this: 1) I have a queue which can run at most 3 jobs 2) I submit an array task with 9 tasks and then another 9 tasks with an array dependency on the first 9. The script "trial" does this as follows (I am user dgruhn): #!/bin/bash echo "Queueing out jobs" qsub -q test.q -N tempjob-first -t 1-9 trial.job first qsub -q test.q -hold_jid_ad tempjob-first -N tempjob-second -t 1-9 trial.job second sleep 20 echo "Deleting jobs" qdel -u dgruhn 'tempjob-*' The script "trial.job" (below) echos out the time and its task number, waits 40 seconds, does it again and then exits. #!/bin/bash #$ -S /bin/bash # If SGE is was not given a rep number : ${SGE_TASK_ID=1} if [ "$SGE_TASK_ID" = "undefined" ] then SGE_TASK_ID=1 fi echo "`date` I am starting $1 task - $SGE_TASK_ID" sleep 40 echo "`date` I am ending $1 task - $SGE_TASK_ID" This is what happens: 1) The tasks 1-3 of tempjob-first start to run. 2a) About 20 seconds later tasks 1-3 of tempjob-first are killed are killed and aborted (I get email notification) Job-array task 205030.3 (tempjob-first) Aborted Exit Status = 137 Signal = KILL User = dgruhn Queue = test.q@dgruhn-f9.group-w-inc.com Host = dgruhn-f9.group-w-inc.com Start Time = 04/13/2010 13:59:22 End Time = 04/13/2010 13:59:40 CPU = 00:00:00 Max vmem = 168.094M failed assumedly after job because: job 205030.3 died through signal KILL (9) 2b) Also at that time, the remaining 4-9 of tempjob-first and tasks 1-3 of tempjob-second are deleted from the pending area. 2c) Additionally, tasks 4-6 of tempjob-second start to run. 3) About 30 seconds later, tasks 4-6 of tempjob-second are aborted, and tasks 7-9 of tempjob-second start to run. Job-array task 205031.4 (tempjob-second) Aborted Exit Status = 137 Signal = KILL User = dgruhn Queue = test.q@dgruhn-f9.group-w-inc.com Host = dgruhn-f9.group-w-inc.com Start Time = 04/13/2010 13:59:42 End Time = 04/13/2010 14:00:13 CPU = 00:00:00 Max vmem = 168.094M failed assumedly after job because: job 205031.4 died through signal KILL (9) 4) In approximately another 30 seconds, tasks 7-9 of tempjob-second are aborted. Job-array task 205031.7 (tempjob-second) Aborted Exit Status = 137 Signal = KILL User = dgruhn Queue = test.q@dgruhn-f9.group-w-inc.com Host = dgruhn-f9.group-w-inc.com Start Time = 04/13/2010 14:00:15 End Time = 04/13/2010 14:00:46 CPU = 00:00:00 Max vmem = 168.094M failed assumedly after job because: job 205031.7 died through signal KILL (9) This is the output I get to my terminal when running "trial" Queueing out jobs Your job-array 205030.1-9:1 ("tempjob-first") has been submitted Your job-array 205031.1-9:1 ("tempjob-second") has been submitted Deleting jobs dgruhn has registered the job-array task 205030.1 for deletion dgruhn has registered the job-array task 205030.2 for deletion dgruhn has registered the job-array task 205030.3 for deletion dgruhn has deleted job 205030 dgruhn has deleted job 205031 This is the output left in my login directory *.o* files (nothing in the *.e* files): Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 1 Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 2 Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 3 Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 4 Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 5 Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 6 Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 7 Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 8 Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 9 I have the scheduler interval set to 10 seconds and my notify time on the test.q is 20 seconds. I have varied both of these around and neither seem to change this odd 31 second runtime. I have also tried catching the USR1 and USR2 signals in the "trial.job" script, but that doesn't make a difference either. Reuti and I also tried a workaround: qalter -h u tempjob-second before the qdel. His results were that unfortunately it's also not working - no hold for these jobs are applied. It looks like jobs with a -hold_jid_ad are somehow neglected by SGE commands in different ways. I have run this on PC Solaris 10 in VMware. I am getting somewhat similar results except that the tempjob-second tasks 4-9 just run to completion, they are never killed. Perhaps there is some timeout that is set differently between the two configurations that I have not been able to determine. In summary, the problem is that some tasks start to run which are supposed to be killed already. The odd thing is that on my Linux setup they do get killed, but only after they are allowed to run for a short time. Dan ------- Additional comments from dangruhn Mon Apr 19 06:58:23 -0700 2010 ------- Forget about the 30 second delete of jobs. I found that my example queue had a hard time limit of 30 seconds. What is happening when that hard limit is removed is that tempjob-first has all of its tasks deleted and tempjob-second has tasks 1-3 deleted. Tasks 4-9 of tempjob-second then start running and run to completion. My apologies for the confusion.
Note: See
TracTickets for help on using
tickets.