Opened 7 years ago

Last modified 7 years ago

#1438 new defect

Parallel jobs will not start outside the default queue while RQS are active

Reported by: Carsten Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u5
Severity: minor Keywords: PE RQS
Cc:

Description

Using

a PE (no matter if MPI or SMP),
having slot limiting RQS active and
required resources not met by the default queue

will lead to :

.....
cannot run because it exceeds limit "lxb712.gsi.de/" in rule "max_slots_per_host/1"
cannot run in PE "smp" because it only offers 0 slots

In the default queue, or with deactivated RQS it works as expected.

Change History (2)

comment:1 follow-up: Changed 7 years ago by dlove

SGE <sge-bugs@…> writes:

Component: sge | Version: 6.2u5

Any idea if this is a problem with the current SGE?

Severity: minor | Keywords: PE RQS


Using

a PE (no matter if MPI or SMP),
having slot limiting RQS active and
required resources not met by the default queue

I don't know what "default queue" means. What is the difference between
the queues you have?

will lead to :

.....
cannot run because it exceeds limit "lxb712.gsi.de/" in rule
"max_slots_per_host/1"
cannot run in PE "smp" because it only offers 0 slots

In the default queue, or with deactivated RQS it works as expected.

It seems clear that the RQS is limiting the number of slots on that
host. Presumably different queues define a different slot count for the
host. (You have to be careful that parallel jobs don't get slots from
multiple queues on the same host, which can lead to over-subscription.)

comment:2 in reply to: ↑ 1 Changed 7 years ago by Carsten

Replying to dlove:

Component: sge | Version: 6.2u5

Any idea if this is a problem with the current SGE?

This I can't tell by heart as we have only 6.2u5 installed, but I've heard about that this behaviuour still exists in newer versions and I could not find a bug report/fix for this so far.
But sure, we will move to a newer version.

Using
a PE (no matter if MPI or SMP),
having slot limiting RQS active and
required resources not met by the default queue

I don't know what "default queue" means. What is the difference between
the queues you have?

Our queues differs in runtime, memory and slot counts, the default queue is just the all.q which is reconfigured.

cannot run because it exceeds limit "lxb712.gsi.de/" in rule
"max_slots_per_host/1"
cannot run in PE "smp" because it only offers 0 slots
In the default queue, or with deactivated RQS it works as expected.

It seems clear that the RQS is limiting the number of slots on that
host.

For this example I've submitted a job which only requests 2 slots to be on the safe side.

Presumably different queues define a different slot count for the
host. (You have to be careful that parallel jobs don't get slots from
multiple queues on the same host, which can lead to over-subscription.)

Ah, very interesting, this is something I didn't know/recognize so far, thanks for the hint.

I'll think about it and do some test. Is it possiblt to set this ticket on hold (or some similar state)?

Best regards,

Carsten

Note: See TracTickets for help on using tickets.