[GE users] Startup times and other issues with 6.0u3

Charu Chaubal Charu.Chaubal at Sun.COM
Sat Mar 19 01:41:56 GMT 2005


Did you NFS mount the SGE files to all hosts?  Could the NFS latency 
with multiple simultaneous processes starting up be causing a delay?

Regards,
	Charu

On Mar 18, 2005, at 5:27 PM, Brian R Smith wrote:

> Ron,
>
> First, sorry for getting so excited... this problem has been bugging 
> me all day.  I am having some problems with MM5 with regards to 
> deleting the processes since shutting of "control slaves".  Most of my 
> other MPI jobs are running much better.  So, shutting off 
> control_slaves disables tight integration?
>
> To answer your questions:
>
> 1) I have the precompiled binaries.  That is what I have always used 
> on all of our other clusters.
> 2) Here are my settings:
>
> Scheduler:
>
> algorithm                         default
> schedule_interval                 0:0:10
> maxujobs                          0
> queue_sort_method                 seqno
> job_load_adjustments              np_load_avg=0.50
> load_adjustment_decay_time        0:7:30
> load_formula                      np_load_avg
> schedd_job_info                   true
> flush_submit_sec                  0
> flush_finish_sec                  0
> params                            none
> reprioritize_interval             0:0:0
> halftime                          168
> usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
> compensation_factor               5.000000
> weight_user                       0.250000
> weight_project                    0.250000
> weight_department                 0.250000
> weight_job                        0.250000
> weight_tickets_functional         0
> weight_tickets_share              0
> share_override_tickets            TRUE
> share_functional_shares           TRUE
> max_functional_jobs_to_schedule   200
> report_pjob_tickets               TRUE
> max_pending_tasks_per_job         50
> halflife_decay_list               none
> policy_hierarchy                  OFS
> weight_ticket                     0.010000
> weight_waiting_time               0.000000
> weight_deadline                   3600000.000000
> weight_urgency                    0.100000
> weight_priority                   1.000000
> max_reservation                   0
> default_duration                  0:10:0
>
> Main queue:
>
> qname                 all.q
> hostlist              gbn001 gbn002 gbn003 gbn004 gbn005 gbn006 gbn007 
> gbn008 \
>                      gbn009 gbn010 gbn011 gbn012 gbn013 gbn014 gbn015 
> gbn016 \
>                      gbn017 gbn018 gbn019 gbn020 gbn021 gbn022 gbn023 
> gbn024 \
>                      gbn025 gbn026 gbn027 gbn028 gbn029 gbn030 gbn031 
> gbn032 \
>                      gbn033 gbn034 gbn035 gbn036 gbn037 gbn038 gbn039 
> gbn040 \
>                      gbn041 gbn042
> seq_no                0
> load_thresholds       NONE
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            1
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make mpich mpich-vasp
> rerun                 FALSE
> slots                 1
> tmpdir                /tmp
> shell                 /bin/csh
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            NONE
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         enabled
> s_rt                  INFINITY
> h_rt                  INFINITY
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               10240K
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 INFINITY
> h_rss                 INFINITY
> s_vmem                INFINITY
> h_vmem                INFINITY
>
> Parallel Environment
>
> pe_name           mpich
> slots             44
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh 
> $pe_hostfile
> stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    FALSE
> job_is_first_task FALSE
> urgency_slots     min
>
> 3) As for the processes, on the primary execution node, I see 
> sge_shepherd-96 -bg, my job script, the mpirun command and the slew of 
> rsh calls that go with it.  On all other slave nodes, I see only 
> in.rshd and two copies of the mpi binary that I originally started 
> with mpirun.
>
> Hope this helps.
>
>
> Brian
>
> Ron Chen wrote:
>
>> --- Brian R Smith <brian at cypher.acomp.usf.edu> wrote:
>>> You are absolutely the man.  Setting "control
>>> slaves" to false fixed all of my problems.
>>>
>>
>> No, it is not fixing anything!
>>
>> "control slaves" means non-tight integration, so you
>> won't get process control/accounting of the slaves MPI
>> tasks.
>>
>> In SGE 6 update 4, the slow start problem was fixed.
>> But the original problem was that starting a 400-node
>> parallel job with tight integration takes several tens
>> seconds or something. But for your case it takes 10
>> minutes! So there is still something going on with
>> your configuration.
>>
>> Did you get the precomplied binaries or compile from
>> source? Also, are you using the default settings or
>> you have already played around with the settings a
>> bit?
>>
>> Also, logon to the nodes and see what processes are
>> running when a parallel job starts.
>>
>> -Ron
>>
>>
>>
>>
>> 		
>> __________________________________ Do you Yahoo!? Yahoo! Small 
>> Business - Try our new resources site!
>> http://smallbusiness.yahoo.com/resources/
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>
###############################################################
# Charu V. Chaubal				# Phone: (650) 786-7672 (x87672)
# Grid Computing Technologist	# Fax:   (650) 786-4591
# Sun Microsystems, Inc.			# Email: charu.chaubal at sun.com
###############################################################


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list