Ticket #813 (new enhancement)

Opened 4 years ago

Last modified 3 years ago

IZ3276: qrsh -inherit should allow -q to select a queue out of the granted ones

Reported by: reuti Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: Keywords: clients
Cc:

Description

[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3276]

        Issue #:      3276             Platform:     All           Reporter: reuti (reuti)
       Component:     gridengine          OS:        All
     Subcomponent:    clients          Version:      6.2u5            CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    ENHANCEMENT
                                   Target milestone: ---
      Assigned to:    roland (roland)
      QA Contact:     roland
          URL:
       * Summary:     qrsh -inherit should allow -q to select a queue out of the granted ones
   Status whiteboard:
      Attachments:

     Issue 3276 blocks:
   Votes for issue 3276:


   Opened: Tue Aug 10 04:07:00 -0700 2010 
------------------------


Although it is often desired to get slots from only one queue for a parallel job, it's valid to attach the same PE to different queues and
get slots from a mixture of queues. When now the job gets slots from a mixture of slots, there is no means by the application to direct
`qrsh -inherit ...` to the correct queue. SGE will select any on its own of the granted ones. When the parallel application now makes e.g. 2
times `qrsh -inherit ...` calls to the same machine, to fork in each of both the granted slots e.g. 2 processes to get 4 in total, all
processes may end up in the same queue with the same set $TMPDIR.

$ qsub -pe openmpi 5 -masterq all.q@pc15370 -q "*@pc15370" ./mymy.sh
Your job 1900 ("mymy.sh") has been submitted
$ cat mymy.sh.o1900
pc15370 1 all.q@pc15370 UNDEFINED
pc15370 2 extra.q@pc15370 UNDEFINED
pc15370 2 extra1.q@pc15370 UNDEFINED
TMPDIR=/tmp/1900.1.extra1.q ==> here it might fork 2 processes
TMPDIR=/tmp/1900.1.extra1.q ==> here it might fork 2 processes
TMPDIR=/tmp/1900.1.extra.q
TMPDIR=/tmp/1900.1.extra.q
TMPDIR=/tmp/1900.1.all.q

With the scripts mymy.sh:

#!/bin/sh
cat $PE_HOSTFILE
. /usr/sge/default/common/settings.sh
qrsh -inherit -V pc15370 ./dummy.sh &
qrsh -inherit -V pc15370 ./dummy.sh &
qrsh -inherit -V pc15370 ./dummy.sh &
qrsh -inherit -V pc15370 ./dummy.sh &
wait
./dummy.sh

and dummy.sh:

#!/bin/sh
env | grep TMPDIR
sleep 30


When the application don't intend to use forks, but starts exactly one process with each `qrsh -inherit ...`, all seems to be fine and SGE
take care to distribute them to the ones from the granted pool, although it can't be predicted which of the `qrsh -inherit ...` will end up
in which of the granted queues.

   ------- Additional comments from reuti Tue Aug 10 04:08:13 -0700 2010 -------
Changed from Defect to Enhancement.

   ------- Additional comments from reuti Tue Aug 10 13:24:21 -0700 2010 -------
When getting slots from 2 nodes, the last paragraph is false and it's not working again: all tasks may end up in one queue:

pc15381 1 all.q@pc15381 UNDEFINED
pc15370 1 all.q@pc15370 UNDEFINED
pc15381 1 extra.q@pc15381 UNDEFINED
pc15370 1 extra.q@pc15370 UNDEFINED
TMPDIR=/tmp/1934.1.all.q
TMPDIR=/tmp/1934.1.all.q
TMPDIR=/tmp/1934.1.all.q
TMPDIR=/tmp/1934.1.all.q

There should two times /tmp/1934.1.extra.q show up. As it can't be controlled by the application (as -q is not allowed for `qrsh -inherit
...`), SGE should handle it in a proper way.
Note: See TracTickets for help on using tickets.