Opened 12 years ago

Closed 6 years ago

#294 closed defect (fixed)

IZ1882: mutually subordinating queues suspend eachother simultaneously

Reported by: bjfcoomer Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords: scheduling
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=1882]

        Issue #:      1882             Platform:     All      Reporter: bjfcoomer (bjfcoomer)
       Component:     gridengine          OS:        All
     Subcomponent:    scheduling       Version:      6.2u5       CC:    None defined
        Status:       REOPENED         Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: 6.0u7
      Assigned to:    sgrell (sgrell)
      QA Contact:     andreas
          URL:
       * Summary:     mutually subordinating queues suspend eachother simultaneously
   Status whiteboard:
      Attachments:

     Issue 1882 blocks:
   Votes for issue 1882:


   Opened: Fri Nov 11 04:04:00 -0700 2005 
------------------------


The full issue is reproduced by the stuff below. The basic problem is that jobs
get scheduled to queues which are mutually subordinate to eachother
simultaneously. So they suspend each other.


>>> (1) A parallel job is running, and one is queued, and serial jobs
>>> are
>>> queued
>>>
>>> sccomp@test:~/EXAMPLE/serial> qstat -f
>>> queuename                      qtype used/tot. load_avg arch
>>> states
>>> --------------------------------------------------------------------
>>> --------
>>> master.q@test.grid.cluster P     1/8       0.00     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> parallel.q@comp00.grid.cluster P     1/1       0.03     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> parallel.q@comp01.grid.cluster P     1/1       0.03     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> parallel.q@comp02.grid.cluster P     1/1       0.07     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> parallel.q@comp03.grid.cluster P     1/1       0.03     lx24-amd64
>>>     525 500.51000 PMB-MPI1.s sccomp       r     11/03/2005 18:44:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> serial.q@comp00.grid.cluster BI    0/2       0.03     lx24-
>>> amd64    S
>>> --------------------------------------------------------------------
>>> --------
>>> serial.q@comp01.grid.cluster BI    0/2       0.03     lx24-
>>> amd64    S
>>> --------------------------------------------------------------------
>>> --------
>>> serial.q@comp02.grid.cluster BI    0/2       0.07     lx24-
>>> amd64    S
>>> --------------------------------------------------------------------
>>> --------
>>> serial.q@comp03.grid.cluster BI    0/2       0.03     lx24-
>>> amd64    S
>>>
>>> ####################################################################
>>> ########
>>>  - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
>>> PENDING
>>> JOBS
>>> ####################################################################
>>> ########
>>>     526 1000.51000 PMB-MPI1.s sccomp       qw    11/03/2005 18:44:28
>>> 5
>>>     527 0.51000 hello.sh   sccomp       qw    11/03/2005
>>> 18:44:45     1
>>>     528 0.51000 hello.sh   sccomp       qw    11/03/2005
>>> 18:44:45     1
>>>     529 0.51000 hello.sh   sccomp       qw    11/03/2005
>>> 18:44:46     1
>>>     530 0.51000 hello.sh   sccomp       qw    11/03/2005
>>> 18:44:46     1
>>>     531 0.51000 hello.sh   sccomp       qw    11/03/2005
>>> 18:44:47     1
>>>     532 0.51000 hello.sh   sccomp       qw    11/03/2005
>>> 18:44:47     1
>>>     533 0.51000 hello.sh   sccomp       qw    11/03/2005
>>> 18:44:48     1
>>>     534 0.51000 hello.sh   sccomp       qw    11/03/2005
>>> 18:44:48     1
>>>
>>>
>>> (2) I qdel the running parallel job and then do qstat -f
>>>
>>> sccomp@test:~/EXAMPLE/serial> qstat -f
>>> queuename                      qtype used/tot. load_avg arch
>>> states
>>> --------------------------------------------------------------------
>>> --------
>>> master.q@test.grid.cluster P     1/8       0.00     lx24-amd64
>>>     526 1000.51000 PMB-MPI1.s sccomp       t     11/03/2005 18:45:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> parallel.q@comp00.grid.cluster P     1/1       0.28     lx24-
>>> amd64    S
>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> parallel.q@comp01.grid.cluster P     1/1       0.28     lx24-
>>> amd64    S
>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> parallel.q@comp02.grid.cluster P     1/1       0.31     lx24-
>>> amd64    S
>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> parallel.q@comp03.grid.cluster P     1/1       0.28     lx24-
>>> amd64    S
>>>     526 1000.51000 PMB-MPI1.s sccomp       St    11/03/2005 18:45:27
>>> 1
>>> --------------------------------------------------------------------
>>> --------
>>> serial.q@comp00.grid.cluster BI    2/2       0.28     lx24-
>>> amd64    S
>>>     527 0.51000 hello.sh   sccomp       St    11/03/2005
>>> 18:45:27     1
>>>     533 0.51000 hello.sh   sccomp       St    11/03/2005
>>> 18:45:27     1
>>> --------------------------------------------------------------------
>>> --------
>>> serial.q@comp01.grid.cluster BI    2/2       0.28     lx24-
>>> amd64    S
>>>     529 0.51000 hello.sh   sccomp       St    11/03/2005
>>> 18:45:27     1
>>>     531 0.51000 hello.sh   sccomp       St    11/03/2005
>>> 18:45:27     1
>>> --------------------------------------------------------------------
>>> --------
>>> serial.q@comp02.grid.cluster BI    2/2       0.31     lx24-
>>> amd64    S
>>>     530 0.51000 hello.sh   sccomp       St    11/03/2005
>>> 18:45:27     1
>>>     534 0.51000 hello.sh   sccomp       St    11/03/2005
>>> 18:45:27     1
>>> --------------------------------------------------------------------
>>> --------
>>> serial.q@comp03.grid.cluster BI    2/2       0.28     lx24-
>>> amd64    S
>>>     528 0.51000 hello.sh   sccomp       St    11/03/2005
>>> 18:45:27     1
>>>     532 0.51000 hello.sh   sccomp       St    11/03/2005
>>> 18:45:27     1
>>>
>>>
>>>
>>> And here is the log from the scheduler monitor:
>>> ::::::::
>>> 525:1:RUNNING:1131043467:600:P:score:slots:5.000000
>>> 525:1:RUNNING:
>>> 1131043467:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000
>>> 525:1:RUNNING:
>>> 1131043467:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000
>>> 525:1:RUNNING:
>>> 1131043467:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000
>>> 525:1:RUNNING:
>>> 1131043467:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000
>>> 525:1:RUNNING:
>>> 1131043467:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>> 526:1:STARTING:1131043527:600:P:score:slots:5.000000
>>> 526:1:STARTING:
>>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:STARTING:
>>> 1131043527:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:STARTING:
>>> 1131043527:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:STARTING:
>>> 1131043527:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:STARTING:
>>> 1131043527:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000
>>> 527:1:STARTING:
>>> 1131043527:600:Q:serial.q@comp00.grid.cluster.ac.uk:slots:1.000000
>>> 528:1:STARTING:
>>> 1131043527:600:Q:serial.q@comp03.grid.cluster.ac.uk:slots:1.000000
>>> 529:1:STARTING:
>>> 1131043527:600:Q:serial.q@comp01.grid.cluster.ac.uk:slots:1.000000
>>> 530:1:STARTING:
>>> 1131043527:600:Q:serial.q@comp02.grid.cluster.ac.uk:slots:1.000000
>>> 531:1:STARTING:
>>> 1131043527:600:Q:serial.q@comp01.grid.cluster.ac.uk:slots:1.000000
>>> 532:1:STARTING:
>>> 1131043527:600:Q:serial.q@comp03.grid.cluster.ac.uk:slots:1.000000
>>> 533:1:STARTING:
>>> 1131043527:600:Q:serial.q@comp00.grid.cluster.ac.uk:slots:1.000000
>>> 534:1:STARTING:
>>> 1131043527:600:Q:serial.q@comp02.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
>>> 526:1:SUSPENDED:
>>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>> 526:1:SUSPENDED:1131043527:600:P:score:slots:5.000000
>>> 526:1:SUSPENDED:
>>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>> 526:1:RUNNING:1131043527:600:P:score:slots:5.000000
>>> 526:1:RUNNING:
>>> 1131043527:600:Q:master.q@test.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:RUNNING:
>>> 1131043527:600:Q:parallel.q@comp00.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:RUNNING:
>>> 1131043527:600:Q:parallel.q@comp02.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:RUNNING:
>>> 1131043527:600:Q:parallel.q@comp03.grid.cluster.ac.uk:slots:1.000000
>>> 526:1:RUNNING:
>>> 1131043527:600:Q:parallel.q@comp01.grid.cluster.ac.uk:slots:1.000000
>>> ::::::::
>>>
>>>

   ------- Additional comments from sgrell Wed Nov 23 01:33:18 -0700 2005 -------
Started working on this issue.

Stephan

   ------- Additional comments from sgrell Wed Nov 23 09:03:03 -0700 2005 -------
Fixed in maintrunk and for u7.

Stephan

   ------- Additional comments from reuti Mon Aug 9 16:41:41 -0700 2010 -------
A parallel job can suspend itself, we he got slots in the sub- and superordinated queue at the same time:

reuti@pc15370:~> qsub -pe openmpi 8 -l h=pc15370 test_mpich.sh
Your job 1868 ("test_mpich.sh") has been submitted
reuti@pc15370:~> qstat -g t
job-ID  prior   name       user         state submit/start at     queue                          master ja-task-ID
------------------------------------------------------------------------------------------------------------------
   1868 0.75500 test_mpich reuti        S     08/10/2010 01:31:11 all.q@pc15370 SLAVE
                                                                  all.q@pc15370 SLAVE
                                                                  all.q@pc15370 SLAVE
                                                                  all.q@pc15370 SLAVE
   1868 0.75500 test_mpich reuti        S     08/10/2010 01:31:11 extra.q@pc15370 MASTER
                                                                  extra.q@pc15370 SLAVE
                                                                  extra.q@pc15370 SLAVE
                                                                  extra.q@pc15370 SLAVE

extra.q is entered as subordinated queue in all.q (classic subordination). There are some other issues which are similar, so I'm not sure
whether this is the most appropriate one or: 437 / 2397

Change History (1)

comment:1 Changed 6 years ago by dlove

  • Resolution set to fixed
  • Severity set to minor
  • Status changed from new to closed

SG-2005-11-23-0

Note: See TracTickets for help on using tickets.