[GE users] Startup times and other issues with 6.0u3

Brian R Smith brian at cypher.acomp.usf.edu
Sat Mar 19 01:27:59 GMT 2005

First, sorry for getting so excited... this problem has been bugging me 
all day.  I am having some problems with MM5 with regards to deleting 
the processes since shutting of "control slaves".  Most of my other MPI 
jobs are running much better.  So, shutting off control_slaves disables 
tight integration?

To answer your questions:

1) I have the precompiled binaries.  That is what I have always used on 
all of our other clusters.
2) Here are my settings:


algorithm                         default
schedule_interval                 0:0:10
maxujobs                          0
queue_sort_method                 seqno
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                     0.010000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority                   1.000000
max_reservation                   0
default_duration                  0:10:0

Main queue:

qname                 all.q
hostlist              gbn001 gbn002 gbn003 gbn004 gbn005 gbn006 gbn007 
gbn008 \
                      gbn009 gbn010 gbn011 gbn012 gbn013 gbn014 gbn015 
gbn016 \
                      gbn017 gbn018 gbn019 gbn020 gbn021 gbn022 gbn023 
gbn024 \
                      gbn025 gbn026 gbn027 gbn028 gbn029 gbn030 gbn031 
gbn032 \
                      gbn033 gbn034 gbn035 gbn036 gbn037 gbn038 gbn039 
gbn040 \
                      gbn041 gbn042
seq_no                0
load_thresholds       NONE
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              0
min_cpu_interval      00:05:00
processors            1
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make mpich mpich-vasp
rerun                 FALSE
slots                 1
tmpdir                /tmp
shell                 /bin/csh
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         enabled
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               10240K
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

Parallel Environment

pe_name           mpich
slots             44
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
allocation_rule   $round_robin
control_slaves    FALSE
job_is_first_task FALSE
urgency_slots     min

3) As for the processes, on the primary execution node, I see 
sge_shepherd-96 -bg, my job script, the mpirun command and the slew of 
rsh calls that go with it.  On all other slave nodes, I see only in.rshd 
and two copies of the mpi binary that I originally started with mpirun.

Hope this helps.


Ron Chen wrote:

>--- Brian R Smith <brian at cypher.acomp.usf.edu> wrote: 
>>You are absolutely the man.  Setting "control
>>slaves" to false fixed all of my problems.
>No, it is not fixing anything!
>"control slaves" means non-tight integration, so you
>won't get process control/accounting of the slaves MPI
>In SGE 6 update 4, the slow start problem was fixed.
>But the original problem was that starting a 400-node
>parallel job with tight integration takes several tens
>seconds or something. But for your case it takes 10
>minutes! So there is still something going on with
>your configuration.
>Did you get the precomplied binaries or compile from
>source? Also, are you using the default settings or
>you have already played around with the settings a
>Also, logon to the nodes and see what processes are
>running when a parallel job starts.
> -Ron
>Do you Yahoo!? 
>Yahoo! Small Business - Try our new resources site!
