Opened 7 years ago

Last modified 7 years ago

#1429 new defect

Large,short jobs do not run.

Reported by: wish Owned by:
Priority: normal Milestone:
Component: sge Version: 6.2u3
Severity: minor Keywords:
Cc:

Description

Using SGE 6.2u3
Short jobs as requested by h_rt requesting a lot of slots do not
obtain a reservation or run even on an empty cluster when they are the
highest priority job.
The minimum requested time for a job appears to be proportional to the
number of slots requested

1 slot(PE) works at 1 second requested
2 slots 3 seconds buggy 4 seconds OK
3 slots 5 seconds buggy 6 seconds OK
4 slots 7 seconds buggy 8 seconds OK
6 slots 11 seconds buggy 12 seconds OK
8 slots 14 seconds buggy 15 seconds OK
12 slots 22 seconds buggy 23 seconds OK
16 slots 29 seconds buggy 30 seconds OK
24 slots 44 seconds buggy 45 seconds OK
32 slots 59 seconds buggy 60 seconds OK
48 slots 89 seconds buggy 90 seconds OK
64 slots 119 seconds buggy 120 seconds OK
96 slots 178 seconds buggy 179 seconds OK
128 slots 238 seconds buggy 239 seconds OK
192 slots 357 seconds buggy 358 seconds OK
256 slots 476 seconds buggy 477 seconds OK

A job with the problem can be detected by running qalter -w v on it.
The report will claim it cannot run because each PE only offers
2147483648 slots.

Job 785731 cannot run in PE "qlc-H" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-X" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-A" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-W" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-F" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-P" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-O" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-1" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-J" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-D" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-T" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-K" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-M" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-L" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-I" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-N" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-B" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-E" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-C" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-G" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-KLB" because it only offers 2147483648 slots
Job 785731 cannot run in PE "qlc-2" because it only offers 2147483648 slots

The above output does not appear with qsub -w v only qalter -w v

Change History (5)

comment:1 Changed 7 years ago by wish

A little more info. qalter -w v reports "only offers 0 slots" for PEs to which the user does not have access. Using qalter to increase the runtime over the threshold does not cause the jobs to become schedulable.

comment:2 Changed 7 years ago by dlove

  • Version changed from 8.1.0 to 6.2u3

I can't reproduce that with the current version in our configuration.
Anything unusual about yours (apart from an impressive number of
qlogical PEs)?

comment:3 Changed 7 years ago by admin

I don't think we have anything too odd anymore. The only thing that I
think might be unusual is our use of complexes with relop EXCL on
queues as well
as hosts. Plus we specify a value for slots in each host's
complex_values. This is to implement our policy that multi-node jobs
have exclusive access
to nodes while single node jobs do not in an environment with more
than one type of node. If it doesn't affect recent versions that's
great I can work
around it till we upgrade.

comment:4 Changed 7 years ago by dlove

William Hay <w.hay@…> writes:

I don't think we have anything too odd anymore. The only thing that I
think might be unusual is our use of complexes with relop EXCL on
queues as well
as hosts.

I'd blame EXCL, if anything. It was likely in play when I found #793,
but I've since removed it. I don't know whether qconf -tsm produces any
different information from qstat/qalter, but it might be worth checking.

comment:5 Changed 7 years ago by dlove

William Hay <w.hay@…> writes:

If it doesn't affect recent versions that's great I can work around it
till we upgrade.

I should have said it's not clear it's fixed.

Note: See TracTickets for help on using tickets.