Custom Query (431 matches)
Results (67 - 69 of 431)
Ticket | Resolution | Summary | Owner | Reporter |
---|---|---|---|---|
#294 | fixed | IZ1882: mutually subordinating queues suspend eachother simultaneously | bjfcoomer | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1882] Issue #: 1882 Platform: All Reporter: bjfcoomer (bjfcoomer) Component: gridengine OS: All Subcomponent: scheduling Version: 6.2u5 CC: None defined Status: REOPENED Priority: P3 Resolution: Issue type: DEFECT Target milestone: 6.0u7 Assigned to: sgrell (sgrell) QA Contact: andreas URL: * Summary: mutually subordinating queues suspend eachother simultaneously Status whiteboard: Attachments: Issue 1882 blocks: Votes for issue 1882: Opened: Fri Nov 11 04:04:00 -0700 2005 ------------------------ The full issue is reproduced by the stuff below. The basic problem is that jobs get scheduled to queues which are mutually subordinate to eachother simultaneously. So they suspend each other. >>> (1) A parallel job is running, and one is queued, and serial jobs >>> are >>> queued >>> >>> sccomp@test:~/EXAMPLE/serial> qstat -f >>> queuename qtype used/tot. load_avg arch >>> states >>> -------------------------------------------------------------------- >>> -------- >>> master.q@test.grid.cluster P 1/8 0.00 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp00.grid.cluster P 1/1 0.03 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp01.grid.cluster P 1/1 0.03 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp02.grid.cluster P 1/1 0.07 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp03.grid.cluster P 1/1 0.03 lx24-amd64 >>> 525 500.51000 PMB-MPI1.s sccomp r 11/03/2005 18:44:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp00.grid.cluster BI 0/2 0.03 lx24- >>> amd64 S >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp01.grid.cluster BI 0/2 0.03 lx24- >>> amd64 S >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp02.grid.cluster BI 0/2 0.07 lx24- >>> amd64 S >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp03.grid.cluster BI 0/2 0.03 lx24- >>> amd64 S >>> >>> #################################################################### >>> ######## >>> - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - >>> PENDING >>> JOBS >>> #################################################################### >>> ######## >>> 526 1000.51000 PMB-MPI1.s sccomp qw 11/03/2005 18:44:28 >>> 5 >>> 527 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:45 1 >>> 528 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:45 1 >>> 529 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:46 1 >>> 530 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:46 1 >>> 531 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:47 1 >>> 532 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:47 1 >>> 533 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:48 1 >>> 534 0.51000 hello.sh sccomp qw 11/03/2005 >>> 18:44:48 1 >>> >>> >>> (2) I qdel the running parallel job and then do qstat -f >>> >>> sccomp@test:~/EXAMPLE/serial> qstat -f >>> queuename qtype used/tot. load_avg arch >>> states >>> -------------------------------------------------------------------- >>> -------- >>> master.q@test.grid.cluster P 1/8 0.00 lx24-amd64 >>> 526 1000.51000 PMB-MPI1.s sccomp t 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp00.grid.cluster P 1/1 0.28 lx24- >>> amd64 S >>> 526 1000.51000 PMB-MPI1.s sccomp St 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp01.grid.cluster P 1/1 0.28 lx24- >>> amd64 S >>> 526 1000.51000 PMB-MPI1.s sccomp St 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp02.grid.cluster P 1/1 0.31 lx24- >>> amd64 S >>> 526 1000.51000 PMB-MPI1.s sccomp St 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> parallel.q@comp03.grid.cluster P 1/1 0.28 lx24- >>> amd64 S >>> 526 1000.51000 PMB-MPI1.s sccomp St 11/03/2005 18:45:27 >>> 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp00.grid.cluster BI 2/2 0.28 lx24- >>> amd64 S >>> 527 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> 533 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp01.grid.cluster BI 2/2 0.28 lx24- >>> amd64 S >>> 529 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> 531 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp02.grid.cluster BI 2/2 0.31 lx24- >>> amd64 S >>> 530 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> 534 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> -------------------------------------------------------------------- >>> -------- >>> serial.q@comp03.grid.cluster BI 2/2 0.28 lx24- >>> amd64 S >>> 528 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> 532 0.51000 hello.sh sccomp St 11/03/2005 >>> 18:45:27 1 >>> >>> >>> >>> And here is the log from the scheduler monitor: >>> :::::::: >>> 525:1:RUNNING:1131043467:600:P:score:slots:5.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 525:1:RUNNING: >>> 1131043467:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> 526:1:STARTING:1131043527:600:P:score:slots:5.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:STARTING: >>> 1131043527:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> 527:1:STARTING: >>> 1131043527:600:Q:serial.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 528:1:STARTING: >>> 1131043527:600:Q:serial.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 529:1:STARTING: >>> 1131043527:600:Q:serial.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> 530:1:STARTING: >>> 1131043527:600:Q:serial.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> 531:1:STARTING: >>> 1131043527:600:Q:serial.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> 532:1:STARTING: >>> 1131043527:600:Q:serial.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 533:1:STARTING: >>> 1131043527:600:Q:serial.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 534:1:STARTING: >>> 1131043527:600:Q:serial.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000 >>> 526:1:SUSPENDED: >>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000 >>> 526:1:SUSPENDED: >>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> 526:1:RUNNING:1131043527:600:P:score:slots:5.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000 >>> 526:1:RUNNING: >>> 1131043527:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000 >>> :::::::: >>> >>> ------- Additional comments from sgrell Wed Nov 23 01:33:18 -0700 2005 ------- Started working on this issue. Stephan ------- Additional comments from sgrell Wed Nov 23 09:03:03 -0700 2005 ------- Fixed in maintrunk and for u7. Stephan ------- Additional comments from reuti Mon Aug 9 16:41:41 -0700 2010 ------- A parallel job can suspend itself, we he got slots in the sub- and superordinated queue at the same time: reuti@pc15370:~> qsub -pe openmpi 8 -l h=pc15370 test_mpich.sh Your job 1868 ("test_mpich.sh") has been submitted reuti@pc15370:~> qstat -g t job-ID prior name user state submit/start at queue master ja-task-ID ------------------------------------------------------------------------------------------------------------------ 1868 0.75500 test_mpich reuti S 08/10/2010 01:31:11 all.q@pc15370 SLAVE all.q@pc15370 SLAVE all.q@pc15370 SLAVE all.q@pc15370 SLAVE 1868 0.75500 test_mpich reuti S 08/10/2010 01:31:11 extra.q@pc15370 MASTER extra.q@pc15370 SLAVE extra.q@pc15370 SLAVE extra.q@pc15370 SLAVE extra.q is entered as subordinated queue in all.q (classic subordination). There are some other issues which are similar, so I'm not sure whether this is the most appropriate one or: 437 / 2397 |
|||
#295 | fixed | IZ1887: complex man page describes regex incorrectly | ovid | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1887] Issue #: 1887 Platform: PC Reporter: ovid (ovid) Component: gridengine OS: All Subcomponent: man Version: 6.0u6 CC: None defined Status: NEW Priority: P4 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: complex man page describes regex incorrectly Status whiteboard: Attachments: Issue 1887 blocks: Votes for issue 1887: Opened: Fri Nov 11 16:09:00 -0700 2005 ------------------------ The man page for complex(5) says under RESTRING: - "[xx]": specifies an array or a range of allowed characters for one character at a specific position That is incorrect. It should say "[x-y]" . The behaviour specified in the man page is not supported. ------- Additional comments from pollinger Mon Mar 23 04:02:04 -0700 2009 ------- Changed Subcomponent to man ------- Additional comments from pollinger Mon Mar 23 04:02:27 -0700 2009 ------- Changed Subcomponent to man ------- Additional comments from pollinger Mon Mar 23 04:03:10 -0700 2009 ------- Changing Subcomponent didn't work, 3rd try... |
|||
#298 | fixed | IZ1894: qstat incorrectly reports job scheduling failure | templedf | |
Description |
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1894] Issue #: 1894 Platform: All Reporter: templedf (templedf) Component: gridengine OS: All Subcomponent: scheduling Version: current CC: None defined Status: NEW Priority: P4 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: sgrell (sgrell) QA Contact: andreas URL: * Summary: qstat incorrectly reports job scheduling failure Status whiteboard: Attachments: Issue 1894 blocks: Votes for issue 1894: Opened: Tue Nov 15 14:23:00 -0700 2005 ------------------------ If I submit a job with an unfulfillable resource request, I get the following: % qsub -l arch=sol-sparc6 /tmp/dant/examples/jobs/sleeper.sh Your job 6 ("Sleeper") has been submitted. % qstat -j 6 ============================================================== job_number: 6 ... scheduling info: (-l arch=sol-sparc6) cannot run globally because (-l arch=sol-sparc6) cannot run at host "balrog.germany.sun.com" because it offers only hl:arch=sol-sparc64 (-l arch=sol-sparc6) cannot run at host "balin" because it offers only hl:arch=sol-sparc64 First of all, should there even be an message that the job cannot be run globally? Secondly, the message is incomplete: cannot be run because why? I have am seeing this problem in the Maintrunk. I have not tested it in any of the release branches. ------- Additional comments from roland Wed Nov 16 01:21:30 -0700 2005 ------- For correct subcomponent tracking I've moved this bug to "scheduling" because the scheduler is responsible for the scheduling info. The qstat command only prints out the messages reported by the scheduler. I assume you want to say with "should there even be an message that the job cannot be run globally" qsub should deny the job at submittion time. This is wrong. Per default qsub accepts all jobs but as in this case they will never be scheduled. You can force the consumable verification at submittion time with the "-w" switch. ------- Additional comments from templedf Wed Nov 16 12:46:15 -0700 2005 ------- I meant that jobs don't run "globally." They run on hosts. Of course the job cannot be run "globally." It's not possible! Of course, I could be wrong about what the message is supposed to mean... ------- Additional comments from sgrell Tue Nov 22 02:22:10 -0700 2005 ------- I will look into it. Stephan ------- Additional comments from sgrell Mon Dec 5 02:01:43 -0700 2005 ------- *** Issue 1817 has been marked as a duplicate of this issue. *** |
Note: See TracQuery
for help on using queries.