[GE users] Incorrect queue suspension

Chris Rudge chris.rudge at astro.le.ac.uk
Tue Nov 6 14:07:34 GMT 2007


I've seen on a couple of occasions an issue with suspension of
subordinate queues. I have a default queue for serial and openmp jobs
and a separate mpi queue. They are subordinates or each other with
suspension occurring as soon as one slot is used in a queue 

the default queue has
	subordinate_list      mpi.q=1
and the mpi q has
	subordinate_list      default.q=1

This works correctly almost all of the time. However I've seen occasions
where the queue instances for both queues are suspended on a node. It
appears that SGE has attempted to launch jobs in both queues
simultaneously which results in both queues then being put into the
suspended state due to being subordinates. This occurred earlier today.

# qstat -qs S
job-ID  prior   name       user         state submit/start at     queue            slots ja-task-ID 
 869047 0.57323 RunHunter. rgw          S     11/06/2007 13:21:36 default.q at comp60     4        
 869048 0.54696 RunHunter. rgw          S     11/06/2007 13:21:36 default.q at comp63     4        
 869057 0.72422 scatter.sh sn85         S     11/06/2007 13:21:36 mpi.q at comp63         8    

Further investigation shows that as the two default.q jobs had a
slightly higher priority they must have been launched before the mpi.q
job so their processes were suspended but the mpi.q job's processes were
running normally.

Is this a known issue? Is it fixed in 6.1 (currently using 6.0u9) or is
their a workaround?


Dr Chris Rudge
chris.rudge at astro.le.ac.uk

UKAFF Facility Manager & Dept. Research Computing Manager
Dept of Physics & Astronomy
University of Leicester

web.  www.ukaff.ac.uk
Tel.  +44 (0)116 2523331
Fax.  +44 (0)116 2231283
Mob.  +44 (0)794 1379420

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list