Opened 4 years ago

#1563 new defect

Odd behaviour with exclusive resource and non-exclusive jobs

Reported by: markdixon Owned by:
Priority: normal Milestone:
Component: sge Version: 8.1.1
Severity: minor Keywords:
Cc:

Description

Seen something distinctly odd on a 8.1.1 system that I don't understand.

Saw some serial jobs with resource reservations not being able to start, yet the relevant queue instances were idle. The resource reservations kept being recreated every scheduling interval on the same queue instance, so the rr just kept drifting into the future.

The jobs were right at the top of the pending qstat list, so had the highest priority. Very few other jobs from the user were running compared to everyone else, so they remained at the top of the list.

qaltering one of the jobs so that rr was turned off got just that one to start running on the queue instance it had the reservation on! The jobs I didn't qalter, with a reservation on the same node, still did not start.

Eventually a couple of other serial jobs started running, but from further down the priority list and from a different user.

This was right after some maintenance work yesterday, using an AR to block the nodes out. The qrsub was:

qrsub -a 201510210830 -e 201510221200 -u @outage_users -N upgrade_work -l h_rt=48:0:0,h_vmem=2G,exclusive -q '*@g8s*' -pe ib 704

grep'ing for host g8s0n1, saw the following lines in the schedd_runlog file generated by a qconf -tsm:

Wed Oct 21 15:43:43 2015|Job 2846584 (-l env=centos6,h_rt=14400,h_vmem=2G,node_type=16core-64G) cannot run at host "g8s0n1.polaris.leeds.ac.uk" because for default request exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:43 2015|Job 2846155 (-l env=centos6,h_rt=43200,h_vmem=20G,node_type=16core-64G) cannot run at host "g8s0n1.polaris.leeds.ac.uk" because for default request exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2846133 (-l env=centos6,exclusive=true,h_rt=172800,h_vmem=3G,node_type=16core-64G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2846969 (-l env=centos6,h_rt=21600,h_vmem=1G,node_type=16core-64G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2846192 (-l env=centos6,exclusive=true,h_rt=172800,h_vmem=4G,node_type=16core-64G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2837314 (-l env=centos6,exclusive=true,h_rt=172800,h_vmem=4G,node_type=16core-64G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2841491 (-l env=centos6,exclusive=true,h_rt=172800,h_vmem=4G,node_type=16core-64G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2846772 (-l env=centos6,h_rt=86400,h_vmem=6G,node_type=16core-64G) cannot run in queue "polaris1.q@…" because it offers only qc:slots=0.000000
Wed Oct 21 15:43:45 2015|Job 2845099 (-l env=centos6,exclusive=true,h_rt=86400,h_vmem=4G,node_type=16core-64G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2823366 (-l env=centos6,exclusive=true,h_rt=172800,h_vmem=4G,node_type=16core-64G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2840589 (-l env=centos6,exclusive=true,h_rt=172800,h_vmem=16G,node_type=16core-256G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in use
Wed Oct 21 15:43:45 2015|Job 2840589 (-l env=centos6,exclusive=true,h_rt=172800,h_vmem=16G,node_type=16core-256G) cannot run in queue "g8s0n1.polaris.leeds.ac.uk" because exclusive resource (exclusive) is already in useit offers only hf:node_type=16core-64G

This was despite the fact that no job was running on the host at all!

We make extensive use of the exclusive flag for parallel jobs. We also have a lot of serial jobs.

  • I tried disabling/enabling the queue instances - no joy.
  • I tried restarting the execd's - no joy.
  • I tried deleting the queue instance and exec host definition and recreated them - that fixed for that instance.

Just removing and adding the queue instance might fix it. Or even modifying the queue. Will try that if I notice it again.

I tried copying the cell's files to a test copy of gridengine and running it in SIMULATE_EXECDS mode, but I couldn't reproduce it.

This would be easier to spot if the scheduler logged to the qmaster's messages file when a job cannot be started in its resource reservation, although I don't know if it would help debug it.

Mark

Change History (0)

Note: See TracTickets for help on using tickets.