Opened 11 years ago

Last modified 11 years ago

#815 new defect

IZ3278: PE entry "job_is_first_task" lowers number of tasks on slave nodes

Reported by: reuti Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: Keywords: kernel


[Imported from gridengine issuezilla]

        Issue #:      3278             Platform:     All      Reporter: reuti (reuti)
       Component:     gridengine          OS:        All
     Subcomponent:    kernel           Version:      6.2u5       CC:    None defined
        Status:       NEW              Priority:     P3
      Resolution:                     Issue type:    DEFECT
                                   Target milestone: ---
      Assigned to:    andreas (andreas)
      QA Contact:     andreas
       * Summary:     PE entry "job_is_first_task" lowers number of tasks on slave nodes
   Status whiteboard:

     Issue 3278 blocks:
   Votes for issue 3278:

   Opened: Tue Aug 10 13:02:00 -0700 2010 

Having a slot distribution for a parallel job which get slots from two (possibly more) queues, the entry "job_is_first_task" will
erroneously limit the number of processes allowed on slave nodes. It should only reduce the number of slots on the master node of the
parallel job, i.e. whether a local `qrsh` more or less is allowed.

Observed behavior:

$ qsub -pe openmpi 4

will get a PE_HOSTFILE:

pc15381 1 all.q@pc15381 UNDEFINED
pc15370 1 all.q@pc15370 UNDEFINED
pc15381 1 extra.q@pc15381 UNDEFINED
pc15370 1 extra.q@pc15370 UNDEFINED

i.e. the job script is running on pc15381. It should be able to make two `qrsh -inherit ...` calls to pc15370. But instead the output is:

error: executing task of job 1932 failed: execution daemon on host "pc15370" didn't accept task

Changing "job_is_first_task" in the PE to "false" solves, the issue. But as a slave node is targeted, this shouldn't have any influence.

Change History (0)

Note: See TracTickets for help on using tickets.