Opened 14 years ago

Last modified 9 years ago

#252 new defect

IZ1681: killed master task with tight integration does not kill slave jobs in special case

Reported by: roland Owned by:
Priority: normal Milestone:
Component: sge Version: 6.0u5
Severity: Keywords: Sun execution
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1681]

        Issue #:      1681             Platform:     Sun      Reporter: roland (roland)
       Component:     gridengine          OS:        All
     Subcomponent:    execution        Version:      6.0u5       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    pollinger (pollinger)
      QA Contact:     pollinger
          URL:
       * Summary:     killed master task with tight integration does not kill slave jobs in special case
   Status whiteboard:
      Attachments:

     Issue 1681 blocks:
   Votes for issue 1681:


   Opened: Thu Jun 30 02:18:00 -0700 2005 
------------------------


The tight integration testsuite subtest "tight_integration_master_killed" prints
out from time to time the message:

job 180 is in state -1
expected to see job 180 in dr state, but job vanished. OK!

In this case the job disappeared from qstat output but the slave tasks are
already running and will never be killed. A ps on the hosts shows a complete
processtree from shepherd to jobscript:
rd141302 30267 18302  0 13:34 ?        00:00:00 sge_shepherd-180 -bg
root     30268 30267  0 13:34 ?        00:00:00
/scratch4/rd141302/ts/utilbin/lx24-amd64/rshd -l
rd141302 30271 30268  0 13:34 ?        00:00:00
/scratch4/rd141302/ts/utilbin/lx24-amd64/qrsh_starter
/usr/local/testsuite/32204/execd/carc/active_jobs/180.1/1.carc noshell
rd141302 30279 30271  0 13:34 ?        00:00:00 /bin/sh
/cod_home/rd141302/source/MAIN/gridengine/testsuite/scripts/pe_task.sh 0 3600
rd141302 30280 30279  0 13:34 ?        00:00:00 sleep 3600

   ------- Additional comments from roland Thu Jun 30 05:10:39 -0700 2005 -------
qmaster messages shows:
06/30/2005 13:44:17|qmaster|carc|W|job 6.1 failed on host carc.germany.sun.com
assumedly after job because: job 6.1 died through signal KILL (9)
06/30/2005 13:44:20|qmaster|carc|E|execd oin reports running state for job
(6.1/1.oin) in queue "tight.q@oin" while job is in state 65536
06/30/2005 13:44:27|qmaster|carc|E|execd@ents.germany.sun.com reports running
job (6.1/1.ents) in queue "tight.q@ents" that was not supposed to be there - killing
06/30/2005 13:44:29|qmaster|carc|E|execd@carc.germany.sun.com reports running
job (6.1/1.carc) in queue "tight.q@carc.germany.sun.com" that was not supposed
to be there - killing

carc execd messages shows:
(message file on oin shows the same error messages)
06/30/2005 13:44:29|execd|carc|E|ja-task "6.1" is unknown - reporting it to qmaster
06/30/2005 13:44:30|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:37|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:38|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:39|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:40|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:41|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:42|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:43|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:44|execd|carc|E|could not find job report for job 6.1 task
1.carc contained in job usage from ptf
06/30/2005 13:44:44|execd|carc|E|acknowledge for unknown job 6.1/master
06/30/2005 13:44:44|execd|carc|E|ERROR: unlinking "jobs/00/0000/0006.1": No such
file or directory
06/30/2005 13:44:44|execd|carc|E|can not remove file job spool file:
jobs/00/0000/0006.1
06/30/2005 13:44:45|execd|carc|E|removing unreferenced job 6.1 without job
report from ptf

on carc and oin the job is already running:
rd141302 20330 14097  0 13:44 ?        00:00:00 sge_shepherd-6 -bg
root     20331 20330  0 13:44 ?        00:00:00
/scratch4/rd141302/ts/utilbin/lx24-amd64/rshd -l
rd141302 20334 20331  0 13:44 ?        00:00:00
/scratch4/rd141302/ts/utilbin/lx24-amd64/qrsh_starter
/usr/local/testsuite/32204/execd/carc/active_jobs/6.1/1.carc noshell
rd141302 20341 20334  0 13:44 ?        00:00:00 /bin/sh
/cod_home/rd141302/source/MAIN/gridengine/testsuite/scripts/pe_task.sh 0 3600
rd141302 20342 20341  0 13:44 ?        00:00:00 sleep 3600

Change History (0)

Note: See TracTickets for help on using tickets.