Opened 11 years ago
Last modified 10 years ago
#815 new defect
IZ3278: PE entry "job_is_first_task" lowers number of tasks on slave nodes
Reported by: | reuti | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.2u5 |
Severity: | Keywords: | kernel | |
Cc: |
Description
[Imported from gridengine issuezilla http://gridengine.sunsource.net/issues/show_bug.cgi?id=3278]
Issue #: 3278 Platform: All Reporter: reuti (reuti) Component: gridengine OS: All Subcomponent: kernel Version: 6.2u5 CC: None defined Status: NEW Priority: P3 Resolution: Issue type: DEFECT Target milestone: --- Assigned to: andreas (andreas) QA Contact: andreas URL: * Summary: PE entry "job_is_first_task" lowers number of tasks on slave nodes Status whiteboard: Attachments: Issue 3278 blocks: Votes for issue 3278: Opened: Tue Aug 10 13:02:00 -0700 2010 ------------------------ Having a slot distribution for a parallel job which get slots from two (possibly more) queues, the entry "job_is_first_task" will erroneously limit the number of processes allowed on slave nodes. It should only reduce the number of slots on the master node of the parallel job, i.e. whether a local `qrsh` more or less is allowed. Observed behavior: $ qsub -pe openmpi 4 job.sh will get a PE_HOSTFILE: pc15381 1 all.q@pc15381 UNDEFINED pc15370 1 all.q@pc15370 UNDEFINED pc15381 1 extra.q@pc15381 UNDEFINED pc15370 1 extra.q@pc15370 UNDEFINED i.e. the job script is running on pc15381. It should be able to make two `qrsh -inherit ...` calls to pc15370. But instead the output is: error: executing task of job 1932 failed: execution daemon on host "pc15370" didn't accept task Changing "job_is_first_task" in the PE to "false" solves, the issue. But as a slave node is targeted, this shouldn't have any influence.
Note: See
TracTickets for help on using
tickets.