[GE users] Startup times and other issues with 6.0u3

Reuti reuti at staff.uni-marburg.de
Sat Mar 19 08:24:38 GMT 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi folks,

just got up and there is always an additional little delay, getting and reading 
all the posts from last night. Okay, I caught up.

One thing I see:

> shell                 /bin/csh
> shell_start_mode      posix_compliant

this is okay for you and the used scripts (most of the time unix_behavior is 
preferred)? What is different for your PE in mpich-vasp?

The MM5 is this: http://www.mmm.ucar.edu/mm5/? Which version of MPICH are you 
using, and how did you compiled it (./configure ...???....). What is your 
mpirun command/script for submitting the job?

In the archive of their website I found this:

http://mailman.ucar.edu/pipermail/mm5-users/2004/000477.html

Seems that MM5-MPP in generating much network traffic. What type of 
network/switch do you have? With 6.0u2 there was no slow down of your jobs?

Cheers - Reuti


Quoting Brian R Smith <brian at cypher.acomp.usf.edu>:

> Ron,
> 
> First, sorry for getting so excited... this problem has been bugging me 
> all day.  I am having some problems with MM5 with regards to deleting 
> the processes since shutting of "control slaves".  Most of my other MPI 
> jobs are running much better.  So, shutting off control_slaves disables 
> tight integration?
> 
> To answer your questions:
> 
> 1) I have the precompiled binaries.  That is what I have always used on 
> all of our other clusters.
> 2) Here are my settings:
> 
> Scheduler:
> 
> algorithm                         default
> schedule_interval                 0:0:10
> maxujobs                          0
> queue_sort_method                 seqno
> job_load_adjustments              np_load_avg=0.50
> load_adjustment_decay_time        0:7:30
> load_formula                      np_load_avg
> schedd_job_info                   true
> flush_submit_sec                  0
> flush_finish_sec                  0
> params                            none
> reprioritize_interval             0:0:0
> halftime                          168
> usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
> compensation_factor               5.000000
> weight_user                       0.250000
> weight_project                    0.250000
> weight_department                 0.250000
> weight_job                        0.250000
> weight_tickets_functional         0
> weight_tickets_share              0
> share_override_tickets            TRUE
> share_functional_shares           TRUE
> max_functional_jobs_to_schedule   200
> report_pjob_tickets               TRUE
> max_pending_tasks_per_job         50
> halflife_decay_list               none
> policy_hierarchy                  OFS
> weight_ticket                     0.010000
> weight_waiting_time               0.000000
> weight_deadline                   3600000.000000
> weight_urgency                    0.100000
> weight_priority                   1.000000
> max_reservation                   0
> default_duration                  0:10:0
> 
> Main queue:
> 
> qname                 all.q
> hostlist              gbn001 gbn002 gbn003 gbn004 gbn005 gbn006 gbn007 
> gbn008 \
>                       gbn009 gbn010 gbn011 gbn012 gbn013 gbn014 gbn015 
> gbn016 \
>                       gbn017 gbn018 gbn019 gbn020 gbn021 gbn022 gbn023 
> gbn024 \
>                       gbn025 gbn026 gbn027 gbn028 gbn029 gbn030 gbn031 
> gbn032 \
>                       gbn033 gbn034 gbn035 gbn036 gbn037 gbn038 gbn039 
> gbn040 \
>                       gbn041 gbn042
> seq_no                0
> load_thresholds       NONE
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            1
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpich mpich-vasp
> rerun                 FALSE
> slots                 1
> tmpdir                /tmp
> shell                 /bin/csh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         enabled
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               10240K
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> Parallel Environment
> 
> pe_name           mpich
> slots             44
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
> stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    FALSE
> job_is_first_task FALSE
> urgency_slots     min
> 
> 3) As for the processes, on the primary execution node, I see 
> sge_shepherd-96 -bg, my job script, the mpirun command and the slew of 
> rsh calls that go with it.  On all other slave nodes, I see only in.rshd 
> and two copies of the mpi binary that I originally started with mpirun.
> 
> Hope this helps.
> 
> 
> Brian
> 
> Ron Chen wrote:
> 
> >--- Brian R Smith <brian at cypher.acomp.usf.edu> wrote: 
> >  
> >
> >>You are absolutely the man.  Setting "control
> >>slaves" to false fixed all of my problems.
> >>    
> >>
> >
> >No, it is not fixing anything!
> >
> >"control slaves" means non-tight integration, so you
> >won't get process control/accounting of the slaves MPI
> >tasks.
> >
> >In SGE 6 update 4, the slow start problem was fixed.
> >But the original problem was that starting a 400-node
> >parallel job with tight integration takes several tens
> >seconds or something. But for your case it takes 10
> >minutes! So there is still something going on with
> >your configuration.
> >
> >Did you get the precomplied binaries or compile from
> >source? Also, are you using the default settings or
> >you have already played around with the settings a
> >bit?
> >
> >Also, logon to the nodes and see what processes are
> >running when a parallel job starts.
> >
> > -Ron
> >
> >
> >
> >
> >		
> >__________________________________ 
> >Do you Yahoo!? 
> >Yahoo! Small Business - Try our new resources site!
> >http://smallbusiness.yahoo.com/resources/ 
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >For additional commands, e-mail: users-help at gridengine.sunsource.net
> >  
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list