[GE users] One queue in a subordinated cluster queue not suspending

Tim Cera tcera at sjrwmd.com
Thu Dec 6 13:21:37 GMT 2007


Sure:

> qconf -sq boinc
qname                 boinc
hostlist              @dual_core @single_core
seq_no                1000
load_thresholds       NONE
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH
ckpt_list             NONE
pe_list               NONE
rerun                 TRUE
slots                 1
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        /sjr/beodata/local/boinc/boinc_controller.sh $host
suspend
resume_method         /sjr/beodata/local/boinc/boinc_controller.sh $host
resume
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY


> qconf -sq single_core
qname                 single_core
hostlist              @single_core
seq_no                1
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpi
rerun                 TRUE
slots                 2
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      boinc=1
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

> qconf -shgrp @single_core
group_name @single_core
hostlist node01 node02 node03 node04

> qconf -shgrp @dual_core
group_name @dual_core
hostlist node05 node06 node07 node08

Thanks,
Tim

-----Original Message-----
From: Dan.Templeton at Sun.COM [mailto:Dan.Templeton at Sun.COM] 
Sent: Thursday, December 06, 2007 12:28 AM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] One queue in a subordinated cluster queue not
suspending

Tim,

Can you send your queue configurations (qconf -sq) for boinc and 
single_core?

Daniel

Tim Cera wrote:
> Hello,
>
> I have a subordinated cluster queue (boinc) and a production queue
> (single_core).  The correct queue within boinc suspends when a job
runs
> in single_core EXCEPT for one queue.  I have tried so many things and
I
> am starting to go around in circles - hence this e-mail and the hope
> that there is an answer out there.
>
>   
>> qstat
>>     
> job-ID  prior   name       user         state submit/start at
queue
>
>
------------------------------------------------------------------------
> --------
>    3867 0.56000 node01     tcera        r     12/05/2007 21:25:00
> boinc at node01  
>    3868 0.56000 node02     tcera        r     12/05/2007 21:25:00
> boinc at node02
>    3869 0.56000 node03     tcera        r     12/05/2007 21:25:00
> boinc at node03
>    3870 0.56000 node04     tcera        r     12/05/2007 21:25:00
> boinc at node04
>    3871 0.56000 node05     tcera        r     12/05/2007 21:25:00
> boinc at node05
>    3872 0.56000 node06     tcera        r     12/05/2007 21:25:00
> boinc at node06
>    3873 0.56000 node07     tcera        r     12/05/2007 21:25:00
> boinc at node07
>    3874 0.56000 node08     tcera        r     12/05/2007 21:25:00
> boinc at node08
>
> Lets add some load jobs to the the single_core cluster queue (nodes 1
> through nodes 4)...
>
>   
>> qstat
>>     
> job-ID  prior   name       user         state submit/start at
queue
>
------------------------------------------------------------------------
> ------------
>    3904 0.56000 load_scr.s tcera        r     12/05/2007 22:01:45
> single_core at node01
>    3908 0.56000 load_scr.s tcera        r     12/05/2007 22:02:00
> single_core at node01
>    3903 0.56000 load_scr.s tcera        r     12/05/2007 22:01:45
> single_core at node02
>    3907 0.56000 load_scr.s tcera        r     12/05/2007 22:02:00
> single_core at node02
>    3901 0.56000 load_scr.s tcera        r     12/05/2007 22:01:45
> single_core at node03
>    3905 0.56000 load_scr.s tcera        r     12/05/2007 22:02:00
> single_core at node03
>    3902 0.56000 load_scr.s tcera        r     12/05/2007 22:01:45
> single_core at node04
>    3906 0.56000 load_scr.s tcera        r     12/05/2007 22:02:00
> single_core at node04
>    3867 0.56000 node01     tcera        S     12/05/2007 21:25:00
> boinc at node01
>    3868 0.56000 node02     tcera        r     12/05/2007 21:25:00
> boinc at node02
>    3869 0.56000 node03     tcera        S     12/05/2007 21:25:00
> boinc at node03
>    3870 0.56000 node04     tcera        S     12/05/2007 21:25:00
> boinc at node04
>    3871 0.56000 node05     tcera        r     12/05/2007 21:25:00
> boinc at node05
>    3872 0.56000 node06     tcera        r     12/05/2007 21:25:00
> boinc at node06
>    3873 0.56000 node07     tcera        r     12/05/2007 21:25:00
> boinc at node07
>    3874 0.56000 node08     tcera        r     12/05/2007 21:25:00
> boinc at node08
>    3909 0.55929 load_scr.s tcera        qw    12/05/2007 22:01:50
>
> Note that every node in single_core has suspended EXCEPT node02.  All
> slots in single_core are filled with a job queued.
>
> Subordinate queue suspension works correctly for the dual_core queue
> (nodes 5 through 8).  It is ONLY node02 that doesn't suspend.
>
> Any ideas on what could be wrong?  No other indication that node02 has
> any problem with grid engine.  No errors in the messages files, and as
> near as I can tell it is configured identical to the other nodes.
>
> Kindest regards,
> Tim Cera, P.E.
> Senior Professional Engineer
> St. Johns River Water Management District
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list