Opened 4 years ago

Last modified 4 years ago

#1555 new defect

Problems with job verification where exclusive flag extensively used

Reported by: markdixon Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.8
Severity: minor Keywords:
Cc:

Description

We are seeing occasional problems with job verification (-w v / -w e) on a cluster where we make extensive use of the exclusive flag for most jobs.

To reproduce, have a simple cluster with a single host running soge 8.1.8:

  • Exclusive flag enabled

$ qconf -sc | grep excl
exclusive excl BOOL EXCL YES YES 0 0
$ qconf -se compute1 | grep exclusive
complex_values h_vmem=64G,node_type=16core-64G,env=centos6,exclusive=true
$ qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
polaris1.q@… BIP 0/0/2 0.00 lx-amd64

Example 1 (job verification, parallel jobs):

$ qsub -clear -pe smp 2 -l h_rt=1:0:0 sleep_time.sh 600
Your job 2114545 ("sleep_time.sh") has been submitted

$ qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
polaris1.q@… BIP 0/2/2 0.00 lx-amd64
2114545 0.50050 sleep_time test1 r 08/27/2015 13:45:07 2

$ qsub -clear -w v -pe smp 2 -l h_rt=1:0:0 sleep_time.sh 600
verification: found suitable queue(s)

$ qsub -clear -w v -pe smp 2 -l h_rt=1:0:0,exclusive sleep_time.sh 600
Unable to run job: Job 2114547 (-l exclusive=TRUE,h_rt=3600) cannot run in queue "compute1.quack.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Job 2114547 cannot run in PE "smp" because it only offers 0 slots
verification: no suitable queues

Example 2 (job verification, serial jobs):

$ qsub -clear -pe smp 2 -l h_rt=1:0:0,exclusive sleep_time.sh 600
Your job 2114550 ("sleep_time.sh") has been submitted

$ qsub -clear -w v -pe smp 2 -l h_rt=1:0:0 sleep_time.sh 600
Unable to run job: Job 2114551 (-l h_rt=3600) cannot run in queue "compute1.quack.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Job 2114551 cannot run in PE "smp" because it only offers 0 slots
verification: no suitable queues
Exiting.

These jobs should not be rejected.

Change History (1)

comment:1 Changed 4 years ago by dlove

SGE <sge-bugs@…> writes:

We are seeing occasional problems with job verification (-w v / -w e) on a
cluster where we make extensive use of the exclusive flag for most jobs.

[I don't think it's specific to exclusive. It's annoyed me for other
sorts of jobs with a default -w w but not exclusive, and it makes the -w w
not very useful.]

I've forgotten how it works, and quite what the difference is between
the qalter version and the qsub one, but I guess it's related to
submit(1)'s

It should also
be noted that load values are not taken into account with the
verification since they are assumed to be too volatile. To cause
-w e verification to be passed at submission time, it is possi-
ble to specify non-volatile values (non-consumables) or maximum
values (consumables) in complex_values.

Note: See TracTickets for help on using tickets.