[GE users] One queue in a subordinated cluster queue not suspending

Tim Cera tcera at sjrwmd.com
Thu Dec 6 03:17:31 GMT 2007


Hello,

I have a subordinated cluster queue (boinc) and a production queue
(single_core).  The correct queue within boinc suspends when a job runs
in single_core EXCEPT for one queue.  I have tried so many things and I
am starting to go around in circles - hence this e-mail and the hope
that there is an answer out there.

> qstat
job-ID  prior   name       user         state submit/start at     queue

------------------------------------------------------------------------
--------
   3867 0.56000 node01     tcera        r     12/05/2007 21:25:00
boinc at node01  
   3868 0.56000 node02     tcera        r     12/05/2007 21:25:00
boinc at node02
   3869 0.56000 node03     tcera        r     12/05/2007 21:25:00
boinc at node03
   3870 0.56000 node04     tcera        r     12/05/2007 21:25:00
boinc at node04
   3871 0.56000 node05     tcera        r     12/05/2007 21:25:00
boinc at node05
   3872 0.56000 node06     tcera        r     12/05/2007 21:25:00
boinc at node06
   3873 0.56000 node07     tcera        r     12/05/2007 21:25:00
boinc at node07
   3874 0.56000 node08     tcera        r     12/05/2007 21:25:00
boinc at node08

Lets add some load jobs to the the single_core cluster queue (nodes 1
through nodes 4)...

> qstat
job-ID  prior   name       user         state submit/start at     queue
------------------------------------------------------------------------
------------
   3904 0.56000 load_scr.s tcera        r     12/05/2007 22:01:45
single_core at node01
   3908 0.56000 load_scr.s tcera        r     12/05/2007 22:02:00
single_core at node01
   3903 0.56000 load_scr.s tcera        r     12/05/2007 22:01:45
single_core at node02
   3907 0.56000 load_scr.s tcera        r     12/05/2007 22:02:00
single_core at node02
   3901 0.56000 load_scr.s tcera        r     12/05/2007 22:01:45
single_core at node03
   3905 0.56000 load_scr.s tcera        r     12/05/2007 22:02:00
single_core at node03
   3902 0.56000 load_scr.s tcera        r     12/05/2007 22:01:45
single_core at node04
   3906 0.56000 load_scr.s tcera        r     12/05/2007 22:02:00
single_core at node04
   3867 0.56000 node01     tcera        S     12/05/2007 21:25:00
boinc at node01
   3868 0.56000 node02     tcera        r     12/05/2007 21:25:00
boinc at node02
   3869 0.56000 node03     tcera        S     12/05/2007 21:25:00
boinc at node03
   3870 0.56000 node04     tcera        S     12/05/2007 21:25:00
boinc at node04
   3871 0.56000 node05     tcera        r     12/05/2007 21:25:00
boinc at node05
   3872 0.56000 node06     tcera        r     12/05/2007 21:25:00
boinc at node06
   3873 0.56000 node07     tcera        r     12/05/2007 21:25:00
boinc at node07
   3874 0.56000 node08     tcera        r     12/05/2007 21:25:00
boinc at node08
   3909 0.55929 load_scr.s tcera        qw    12/05/2007 22:01:50

Note that every node in single_core has suspended EXCEPT node02.  All
slots in single_core are filled with a job queued.

Subordinate queue suspension works correctly for the dual_core queue
(nodes 5 through 8).  It is ONLY node02 that doesn't suspend.

Any ideas on what could be wrong?  No other indication that node02 has
any problem with grid engine.  No errors in the messages files, and as
near as I can tell it is configured identical to the other nodes.

Kindest regards,
Tim Cera, P.E.
Senior Professional Engineer
St. Johns River Water Management District

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list