Opened 9 years ago
Last modified 9 years ago
#1438 new defect
Parallel jobs will not start outside the default queue while RQS are active
Reported by: | Carsten | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | |
Component: | sge | Version: | 6.2u5 |
Severity: | minor | Keywords: | PE RQS |
Cc: |
Description
Using
a PE (no matter if MPI or SMP),
having slot limiting RQS active and
required resources not met by the default queue
will lead to :
.....
cannot run because it exceeds limit "lxb712.gsi.de/" in rule "max_slots_per_host/1"
cannot run in PE "smp" because it only offers 0 slots
In the default queue, or with deactivated RQS it works as expected.
Change History (2)
comment:1 follow-up: ↓ 2 Changed 9 years ago by dlove
comment:2 in reply to: ↑ 1 Changed 9 years ago by Carsten
Replying to dlove:
Component: sge | Version: 6.2u5
Any idea if this is a problem with the current SGE?
This I can't tell by heart as we have only 6.2u5 installed, but I've heard about that this behaviuour still exists in newer versions and I could not find a bug report/fix for this so far.
But sure, we will move to a newer version.
Using
a PE (no matter if MPI or SMP),
having slot limiting RQS active and
required resources not met by the default queue
I don't know what "default queue" means. What is the difference between
the queues you have?
Our queues differs in runtime, memory and slot counts, the default queue is just the all.q which is reconfigured.
cannot run because it exceeds limit "lxb712.gsi.de/" in rule
"max_slots_per_host/1"
cannot run in PE "smp" because it only offers 0 slots
In the default queue, or with deactivated RQS it works as expected.
It seems clear that the RQS is limiting the number of slots on that
host.
For this example I've submitted a job which only requests 2 slots to be on the safe side.
Presumably different queues define a different slot count for the
host. (You have to be careful that parallel jobs don't get slots from
multiple queues on the same host, which can lead to over-subscription.)
Ah, very interesting, this is something I didn't know/recognize so far, thanks for the hint.
I'll think about it and do some test. Is it possiblt to set this ticket on hold (or some similar state)?
Best regards,
Carsten
SGE <sge-bugs@…> writes:
Any idea if this is a problem with the current SGE?
I don't know what "default queue" means. What is the difference between
the queues you have?
It seems clear that the RQS is limiting the number of slots on that
host. Presumably different queues define a different slot count for the
host. (You have to be careful that parallel jobs don't get slots from
multiple queues on the same host, which can lead to over-subscription.)