Opened 8 years ago

Last modified 7 years ago

#801 new defect

IZ3263: Qdel With Array Task Dependencies Problem

Reported by: dangruhn Owned by:
Priority: high Milestone:
Component: sge Version: 6.2u5
Severity: Keywords: PC Linux scheduling
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3263]

        Issue #:      3263             Platform:     PC       Reporter: dangruhn (dangruhn)
       Component:     gridengine          OS:        Linux
     Subcomponent:    scheduling       Version:      6.2u5       CC:
                                                                        [_] reuti
                                                                        [_] Remove selected CCs
        Status:       NEW              Priority:     P2
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
          URL:
       * Summary:     Qdel With Array Task Dependencies Problem
   Status whiteboard:
      Attachments:

     Issue 3263 blocks:
   Votes for issue 3263:


   Opened: Thu Apr 15 06:08:00 -0700 2010 
------------------------


I have come across an error in SGE 2.6u5 on amd64 Linux Fedora 9 (info on VMware Solaris 10 i386 farther on down) having to do with the
-hold-jid-ad array dependencies and qdel.  In trying to simplify a reproduce, I have come down to this:

1) I have a queue which can run at most 3 jobs
2) I submit an array task with 9 tasks and then another 9 tasks with an array dependency on the first 9. The script "trial" does this as
follows (I am user dgruhn):

#!/bin/bash

echo "Queueing out jobs"
qsub -q test.q -N tempjob-first -t 1-9 trial.job first
qsub -q test.q -hold_jid_ad tempjob-first -N tempjob-second -t 1-9 trial.job second

sleep 20

echo "Deleting jobs"
qdel -u dgruhn 'tempjob-*'

The script "trial.job" (below) echos out the time and its task number, waits 40 seconds, does it again and then exits.

#!/bin/bash
#$ -S /bin/bash

# If SGE is was not given a rep number
: ${SGE_TASK_ID=1}
if [ "$SGE_TASK_ID" = "undefined" ]
then
    SGE_TASK_ID=1
fi

echo "`date` I am starting $1 task - $SGE_TASK_ID"
sleep 40
echo "`date` I am ending $1 task - $SGE_TASK_ID"

This is what happens:

1) The tasks 1-3 of tempjob-first start to run.

2a) About 20 seconds later tasks 1-3 of tempjob-first are killed are killed and aborted (I get email notification)

Job-array task 205030.3 (tempjob-first) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            = test.q@dgruhn-f9.group-w-inc.com
 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 13:59:22
 End Time         = 04/13/2010 13:59:40
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:
job 205030.3 died through signal KILL (9)

2b) Also at that time, the remaining 4-9 of tempjob-first and tasks 1-3 of tempjob-second are deleted from the pending area.

2c) Additionally, tasks 4-6 of tempjob-second start to run.

3) About 30 seconds later, tasks 4-6 of tempjob-second are aborted, and tasks 7-9 of tempjob-second start to run.

Job-array task 205031.4 (tempjob-second) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            = test.q@dgruhn-f9.group-w-inc.com
 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 13:59:42
 End Time         = 04/13/2010 14:00:13
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:

job 205031.4 died through signal KILL (9)


4) In approximately another 30 seconds, tasks 7-9 of tempjob-second are aborted.

Job-array task 205031.7 (tempjob-second) Aborted
 Exit Status      = 137
 Signal           = KILL
 User             = dgruhn
 Queue            = test.q@dgruhn-f9.group-w-inc.com
 Host             = dgruhn-f9.group-w-inc.com
 Start Time       = 04/13/2010 14:00:15
 End Time         = 04/13/2010 14:00:46
 CPU              = 00:00:00
 Max vmem         = 168.094M
failed assumedly after job because:
job 205031.7 died through signal KILL (9)


This is the output I get to my terminal when running "trial"

Queueing out jobs
Your job-array 205030.1-9:1 ("tempjob-first") has been submitted
Your job-array 205031.1-9:1 ("tempjob-second") has been submitted
Deleting jobs
dgruhn has registered the job-array task 205030.1 for deletion
dgruhn has registered the job-array task 205030.2 for deletion
dgruhn has registered the job-array task 205030.3 for deletion
dgruhn has deleted job 205030
dgruhn has deleted job 205031

This is the output left in my login directory *.o* files (nothing in the *.e* files):

Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 1
Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 2
Tue Apr 13 13:59:22 EDT 2010 I am starting first task - 3
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 4
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 5
Tue Apr 13 13:59:42 EDT 2010 I am starting second task - 6
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 7
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 8
Tue Apr 13 14:00:15 EDT 2010 I am starting second task - 9


I have the scheduler interval set to 10 seconds and my notify time on the test.q is 20 seconds.  I have varied both of these around and
neither seem to change this odd 31 second runtime. I have also tried catching the USR1 and USR2 signals in the "trial.job" script, but that
doesn't make a difference either.

Reuti and I also tried a workaround:

qalter -h u tempjob-second

before the qdel. His results were that unfortunately it's also not working - no hold for these jobs are applied. It looks like jobs with a
-hold_jid_ad are somehow neglected by SGE commands in different ways.

I have run this on PC Solaris 10 in VMware.  I am getting somewhat similar results except that the tempjob-second tasks 4-9 just run to
completion, they are never killed.  Perhaps there is some timeout that is set differently between the two configurations that I have not
been able to determine.

In summary, the problem is that some tasks start to run which are supposed to be killed already. The odd thing is that on my Linux setup
they do get killed, but only after they are allowed to run for a short time.

Dan

   ------- Additional comments from dangruhn Mon Apr 19 06:58:23 -0700 2010 -------
Forget about the 30 second delete of jobs. I found that my example queue had a hard time limit of 30 seconds.

What is happening when that hard limit is removed is that tempjob-first has all of its tasks deleted and
tempjob-second has tasks 1-3 deleted. Tasks 4-9 of tempjob-second then start running and run to completion.

My apologies for the confusion.

Change History (0)

Note: See TracTickets for help on using tickets.