Opened 16 years ago
Last modified 10 years ago
#252 new defect
IZ1681: killed master task with tight integration does not kill slave jobs in special case
Reported by: | roland | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.0u5 |
Severity: | Keywords: | Sun execution | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1681]
Issue #: 1681 Platform: Sun Reporter: roland (roland) Component: gridengine OS: All Subcomponent: execution Version: 6.0u5 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: pollinger (pollinger) QA Contact: pollinger URL: * Summary: killed master task with tight integration does not kill slave jobs in special case Status whiteboard: Attachments: Issue 1681 blocks: Votes for issue 1681: Opened: Thu Jun 30 02:18:00 -0700 2005 ------------------------ The tight integration testsuite subtest "tight_integration_master_killed" prints out from time to time the message: job 180 is in state -1 expected to see job 180 in dr state, but job vanished. OK! In this case the job disappeared from qstat output but the slave tasks are already running and will never be killed. A ps on the hosts shows a complete processtree from shepherd to jobscript: rd141302 30267 18302 0 13:34 ? 00:00:00 sge_shepherd-180 -bg root 30268 30267 0 13:34 ? 00:00:00 /scratch4/rd141302/ts/utilbin/lx24-amd64/rshd -l rd141302 30271 30268 0 13:34 ? 00:00:00 /scratch4/rd141302/ts/utilbin/lx24-amd64/qrsh_starter /usr/local/testsuite/32204/execd/carc/active_jobs/180.1/1.carc noshell rd141302 30279 30271 0 13:34 ? 00:00:00 /bin/sh /cod_home/rd141302/source/MAIN/gridengine/testsuite/scripts/pe_task.sh 0 3600 rd141302 30280 30279 0 13:34 ? 00:00:00 sleep 3600 ------- Additional comments from roland Thu Jun 30 05:10:39 -0700 2005 ------- qmaster messages shows: 06/30/2005 13:44:17|qmaster|carc|W|job 6.1 failed on host carc.germany.sun.com assumedly after job because: job 6.1 died through signal KILL (9) 06/30/2005 13:44:20|qmaster|carc|E|execd oin reports running state for job (6.1/1.oin) in queue "tight.q@oin" while job is in state 65536 06/30/2005 13:44:27|qmaster|carc|E|execd@ents.germany.sun.com reports running job (6.1/1.ents) in queue "tight.q@ents" that was not supposed to be there - killing 06/30/2005 13:44:29|qmaster|carc|E|execd@carc.germany.sun.com reports running job (6.1/1.carc) in queue "tight.q@carc.germany.sun.com" that was not supposed to be there - killing carc execd messages shows: (message file on oin shows the same error messages) 06/30/2005 13:44:29|execd|carc|E|ja-task "6.1" is unknown - reporting it to qmaster 06/30/2005 13:44:30|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:37|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:38|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:39|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:40|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:41|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:42|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:43|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:44|execd|carc|E|could not find job report for job 6.1 task 1.carc contained in job usage from ptf 06/30/2005 13:44:44|execd|carc|E|acknowledge for unknown job 6.1/master 06/30/2005 13:44:44|execd|carc|E|ERROR: unlinking "jobs/00/0000/0006.1": No such file or directory 06/30/2005 13:44:44|execd|carc|E|can not remove file job spool file: jobs/00/0000/0006.1 06/30/2005 13:44:45|execd|carc|E|removing unreferenced job 6.1 without job report from ptf on carc and oin the job is already running: rd141302 20330 14097 0 13:44 ? 00:00:00 sge_shepherd-6 -bg root 20331 20330 0 13:44 ? 00:00:00 /scratch4/rd141302/ts/utilbin/lx24-amd64/rshd -l rd141302 20334 20331 0 13:44 ? 00:00:00 /scratch4/rd141302/ts/utilbin/lx24-amd64/qrsh_starter /usr/local/testsuite/32204/execd/carc/active_jobs/6.1/1.carc noshell rd141302 20341 20334 0 13:44 ? 00:00:00 /bin/sh /cod_home/rd141302/source/MAIN/gridengine/testsuite/scripts/pe_task.sh 0 3600 rd141302 20342 20341 0 13:44 ? 00:00:00 sleep 3600
Note: See
TracTickets for help on using
tickets.