Opened 7 years ago
Last modified 7 years ago
#1451 new defect
PE setting "job_is_first_task" limits the number of `qrsh -inherit ...` calls even to slave systems
Reported by: | Reuti | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.2u5 |
Severity: | minor | Keywords: | |
Cc: |
Description
The purpose of the PE setting "job_is_first_task" is twofold:
- it will influence the accounting under certain circumstances
- it will control how many (local) qrsh -inherit ... calls are allowed
In the latter case it should allow or disallow an additonal local qrsh -inherit ... to the machine where the jobscript is already being executed. Depending on the programming logic the process started locally by the jobscript might already be part of the parallel application and hence it should allow only (n-1) local qrsh -inherit ... calls if set to TRUE. This works as intended.
But this setting seems to influence the amount of qrsh -inherit ... calls made to slave systems too. On these always the number of granted slots should always be allowed.
This setting might be related in case of a redesign to: https://arc.liv.ac.uk/trac/SGE/ticket/197
The reason behind this behavior is also related to https://arc.liv.ac.uk/trac/SGE/ticket/813
(machines is the unfolded $PE_HOSTFILE like for MPICH1) Having only on queue in the system:
it’s working even with job_is_first_task TRUE. Also the value of $NSLOTS on the slave is set to the number of slots from this queue. But having two queues in the system, it only works with job_is first_task FALSE as maybe the wrong queue is addressed (the one granted slot is already used up):
If the number of used slots is now increased, even job_is_first_task FALSE won’t help for the slave node and it crashes:
The 6th call can’t be made.
NB: actual MPI implementations are making only one qrsh -inherit ... call at all to each slave host and shouldn’t be hit by this.