Opened 15 years ago
Closed 9 years ago
#294 closed defect (fixed)
IZ1882: mutually subordinating queues suspend eachother simultaneously
Reported by: | bjfcoomer | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.2u5 |
Severity: | minor | Keywords: | scheduling |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1882]
Issue #: 1882 Platform: All Reporter: bjfcoomer (bjfcoomer) Component: gridengine OS: All Subcomponent: scheduling Version: 6.2u5 CC: None defined Status: REOPENED Priority: P3 Resolution: Issue type: DEFECT Target milestone: 6.0u7 Assigned to: sgrell (sgrell) QA Contact: andreas URL: * Summary: mutually subordinating queues suspend eachother simultaneously Status whiteboard: Attachments: Issue 1882 blocks: Votes for issue 1882: Opened: Fri Nov 11 04:04:00 -0700 2005 ------------------------ The full issue is reproduced by the stuff below. The basic problem is that jobs get scheduled to queues which are mutually subordinate to eachother simultaneously. So they suspend each other. >>> (1) A parallel job is running, and one is queued, and serial jobs >>> are >>> queued >>> >>> sccomp@test:~/EXAMPLE/serial> qstat -f >>> queuename qtype used/tot. load_avg arch >>> states >>> -------------------------------------------------------------------- >>> -------- >>> master.q@test.grid.cluster P 1/8 0.00 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp00.grid.cluster P 1/1 0.03 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp01.grid.cluster P 1/1 0.03 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp02.grid.cluster P 1/1 0.07 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp03.grid.cluster P 1/1 0.03 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp00.grid.cluster BI 0/2 0.03 lx24- >>> amd64 S >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp01.grid.cluster BI 0/2 0.03 lx24- >>> amd64 S >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp02.grid.cluster BI 0/2 0.07 lx24- >>> amd64 S >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp03.grid.cluster BI 0/2 0.03 lx24- >>> amd64 S >>> >>> #################################################################### >>> ######## >>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - >>> PENDING >>> JOBS >>> #################################################################### >>> ######## >>> 526 1000.51000 PMB-MPI1.s sccomp qw 11/03/2005 18:44:28 >>> 5 >>> 527 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:45 1 >>> 528 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:45 1 >>> 529 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:46 1 >>> 530 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:46 1 >>> 531 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:47 1 >>> 532 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:47 1 >>> 533 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:48 1 >>> 534 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:48 1 >>> >>> >>> (2) I qdel the running parallel job and then do qstat -f >>> >>> sccomp@test:~/EXAMPLE/serial> qstat -f >>> queuename qtype used/tot. load_avg arch >>> states >>> -------------------------------------------------------------------- >>> -------- >>> master.q@test.grid.cluster P 1/8 0.00 lx24-amd64 >>> 526 1000.51000 PMB-MPI1.s sccomp t 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp00.grid.cluster P 1/1 0.28 lx24- >>> amd64 S >>> 526 1000.51000 PMB-MPI1.s sccomp St 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp01.grid.cluster P 1/1 0.28 lx24- >>> amd64 S >>> 526 1000.51000 PMB-MPI1.s sccomp St 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp02.grid.cluster P 1/1 0.31 lx24- >>> amd64 S >>> 526 1000.51000 PMB-MPI1.s sccomp St 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp03.grid.cluster P 1/1 0.28 lx24- >>> amd64 S >>> 526 1000.51000 PMB-MPI1.s sccomp St 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp00.grid.cluster BI 2/2 0.28 lx24- >>> amd64 S >>> 527 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> 533 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp01.grid.cluster BI 2/2 0.28 lx24- >>> amd64 S >>> 529 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> 531 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp02.grid.cluster BI 2/2 0.31 lx24- >>> amd64 S >>> 530 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> 534 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp03.grid.cluster BI 2/2 0.28 lx24- >>> amd64 S >>> 528 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> 532 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> >>> >>> >>> And here is the log from the scheduler monitor: >>> :::::::: >>> 525:1:RUNNING:1131043467:600:P:score:slots:5.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> 526:1:STARTING:1131043527:600:P:score:slots:5.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> 527:1:STARTING: >>> 1131043527:600:Q:serial.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 528:1:STARTING: >>> 1131043527:600:Q:serial.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 529:1:STARTING: >>> 1131043527:600:Q:serial.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> 530:1:STARTING: >>> 1131043527:600:Q:serial.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> 531:1:STARTING: >>> 1131043527:600:Q:serial.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> 532:1:STARTING: >>> 1131043527:600:Q:serial.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 533:1:STARTING: >>> 1131043527:600:Q:serial.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 534:1:STARTING: >>> 1131043527:600:Q:serial.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000 >>> 526:1:SUSPENDED: >>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000 >>> 526:1:SUSPENDED: >>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> 526:1:RUNNING:1131043527:600:P:score:slots:5.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> >>> ------- Additional comments from sgrell Wed Nov 23 01:33:18 -0700 2005 ------- Started working on this issue. Stephan ------- Additional comments from sgrell Wed Nov 23 09:03:03 -0700 2005 ------- Fixed in maintrunk and for u7. Stephan ------- Additional comments from reuti Mon Aug 9 16:41:41 -0700 2010 ------- A parallel job can suspend itself, we he got slots in the sub- and superordinated queue at the same time: reuti@pc15370:~> qsub -pe openmpi 8 -l h=pc15370 test_mpich.sh Your job 1868 ("test_mpich.sh") has been submitted reuti@pc15370:~> qstat -g t job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------------------------------------------------------------ 1868 0.75500 test_mpich reuti S 08/10/2010 01:31:11 all.q@pc15370 SLAVE all.q@pc15370 SLAVE all.q@pc15370 SLAVE all.q@pc15370 SLAVE 1868 0.75500 test_mpich reuti S 08/10/2010 01:31:11 extra.q@pc15370 MASTER extra.q@pc15370 SLAVE extra.q@pc15370 SLAVE extra.q@pc15370 SLAVE extra.q is entered as subordinated queue in all.q (classic subordination). There are some other issues which are similar, so I'm not sure whether this is the most appropriate one or: 437 / 2397
Change History (1)
comment:1 Changed 9 years ago by dlove
- Resolution set to fixed
- Severity set to minor
- Status changed from new to closed
Note: See
TracTickets for help on using
tickets.
SG-2005-11-23-0