[GE users] Sort by sequence number question

Erik Lönroth erik.lonroth at scania.com
Thu Jul 12 12:49:55 BST 2007


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hmmm...

The only way I can get SGE not to schedule jobs on the same node as the
MASTER process is to use "$round_robin" for my PE as Reuti suggested. I
really can't see the logics in this tho.

Regardless of how I set sequence number for my master nodes, SGE will
ALWAYS assign 1 MASTER + 4 SLAVES onto the selected MASTER nodes (if I
use $fill_up).

To worse is that my application is not run optimally when splitting the
parallell job round robin, so the only solution for me is to explicitly
remove the PE from all master nodes and thereby loosing available
resources.

/Erik

On ons, 2007-07-11 at 11:47 -0700, Daniel Templeton wrote:
> Wrik,
> 
> You may also want to read this post from Stephan's blog:
> 
> http://blog.sun.com/sgrell/entry/n1ge_6_scheduler_hacks_seperated
> 
> Daniel
> 
> Lönroth Erik wrote:
> > I am.
> >
> > I'm submitting this job to the queue.
> > bash-3.00$ cat slot-allocation.job 
> > #!/bin/bash
> > #$ -S /bin/bash
> > #$ -N slot-allocation
> > #$ -cwd 
> > #$ -o output.$JOB_ID
> > #$ -e errors.$JOB_ID
> >
> > #$ -pe powerflow_*_pe  5
> >
> > #$ -masterq master.*.q
> > echo "Starting on: ${HOSTNAME}"
> > echo "$PE_HOSTFILE contains:"
> > cat $PE_HOSTFILE
> > sleep 30
> >
> >
> >
> >     215 0.55500 slot-alloc sssler       r     07/11/2007 17:19:24 master.101.q at ts101-1-0.sss.se. MASTER 
> >     215 0.55500 slot-alloc sssler       r     07/11/2007 17:19:24 short.101.q at ts101-1-0.sss.se.s SLAVE  
> >                                                                   short.101.q at ts101-1-0.sss.se.s SLAVE  
> >                                                                   short.101.q at ts101-1-0.sss.se.s SLAVE  
> >                                                                   short.101.q at ts101-1-0.sss.se.s SLAVE  
> >                                            
> >
> > I also sometimes see that it does allocate a slot on a "MASTER" node even if slots are available on other machines, taking only 3/4 slots and putting 1 slot on a completely different host. 
> >
> >
> > ... like here for example where ts103-3-13 gets 3 slots filled, whereas it has 4 to offer. I would expect all 4 slots to be taken before the master.103.q at ts103-3-0 would be considered at all - since it has a higher sequence number. That doesn't seem to happen... *cry*
> >
> >  ( For you who has follow this thread, ts101 is a smaller test cluster we use for testing out queues and ts103+ts102 are partitions of a larger SGE_CELL ) 
> >
> >
> >
> >     977 0.55500 slot-alloc sssler       r     07/11/2007 18:04:54 short.103.q at ts103-3-12.sss.se. SLAVE         
> >                                                                   short.103.q at ts103-3-12.sss.se. SLAVE         
> >                                                                   short.103.q at ts103-3-12.sss.se. SLAVE         
> >                                                                   short.103.q at ts103-3-12.sss.se. SLAVE         
> >     977 0.55500 slot-alloc sssler       r     07/11/2007 18:04:54 short.103.q at ts103-3-13.sss.se. SLAVE         
> >                                                                   short.103.q at ts103-3-13.sss.se. SLAVE         
> >                                                                   short.103.q at ts103-3-13.sss.se. SLAVE         
> >     977 0.55500 slot-alloc sssler       r     07/11/2007 18:04:54 short.103.q at ts103-3-14.sss.se. SLAVE         
> >                                                                   short.103.q at ts103-3-14.sss.se. SLAVE         
> >                                                                   short.103.q at ts103-3-14.sss.se. SLAVE         
> >                                                                   short.103.q at ts103-3-14.sss.se. SLAVE         
> >     977 0.55500 slot-alloc sssler       r     07/11/2007 18:04:54 short.103.q at ts103-3-15.sss.se. SLAVE        
> >     977 0.55500 slot-alloc sssler       r     07/11/2007 18:04:54 master.103.q at ts103-3-0.sss.se. MASTER        
> >     977 0.55500 slot-alloc sssler       r     07/11/2007 18:04:54 master.103.q at ts103-3-1.sss.se. SLAVE
> >
> > I'm in pain. Arhhhh!
> >
> > /Erik
> >
> > -----Original Message-----
> > From: Ravi Chandra Nallan [mailto:Ravichandra.Nallan at Sun.COM]
> > Sent: Wed 7/11/2007 5:05 PM
> > To: users at gridengine.sunsource.net
> > Subject: Re: [GE users] Sort by sequence number question
> >  
> > Can you try with some simple batch/array jobs
> > eg. qsub -t 1-5 examples/jobs/sleeper.sh 10000
> > and see which one gets filled first!
> > regards,
> > -Ravi
> >
> > Lönroth Erik wrote:
> >   
> >>> Didn't seem to work.
> >>>  qconf -sconf 
> >>>  qconf -ssconf
> >>>  qconf -sq \*
> >>>  qconf -se global
> >>>
> >>> Might be a better option.
> >>> /mark
> >>>     
> >>>       
> >> Here it goes:
> >>
> >> bash-3.00$ qconf -sconf
> >> global:
> >> execd_spool_dir              /opt/gridengine/narcissus/spool
> >> mailer                       /opt/gridengine/scania/utils/mailing/mailer1.sh
> >> xterm                        /usr/bin/X11/xterm
> >> load_sensor                  /opt/gridengine/scania/utils/licensecheck.sh
> >> prolog                       none
> >> epilog                       none
> >> shell_start_mode             posix_compliant
> >> login_shells                 sh,ksh,csh,tcsh
> >> min_uid                      0
> >> min_gid                      0
> >> user_lists                   none
> >> xuser_lists                  none
> >> projects                     none
> >> xprojects                    none
> >> enforce_project              false
> >> enforce_user                 auto
> >> load_report_time             00:00:40
> >> max_unheard                  00:05:00
> >> reschedule_unknown           00:00:00
> >> loglevel                     log_warning
> >> administrator_mail           erik.lonroth at scania.com
> >> set_token_cmd                none
> >> pag_cmd                      none
> >> token_extend_time            none
> >> shepherd_cmd                 none
> >> qmaster_params               none
> >> execd_params                 none
> >> reporting_params             accounting=true reporting=false \
> >>                              flush_time=00:00:15 joblog=false sharelog=00:00:00
> >> finished_jobs                100
> >> gid_range                    20000-20500
> >> qlogin_command               /opt/gridengine/scania/utils/qlogin/qlogin.sh
> >> qlogin_daemon                /usr/sbin/sshd -i
> >> rlogin_daemon                /usr/sbin/sshd -i
> >> max_aj_instances             0
> >> max_aj_tasks                 0
> >> max_u_jobs                   0
> >> max_jobs                     0
> >> auto_user_oticket            0
> >> auto_user_fshare             0
> >> auto_user_default_project    none
> >> auto_user_delete_time        86400
> >> delegated_file_staging       false
> >> rsh_daemon                   /usr/sbin/sshd -i
> >> rsh_command                  /usr/bin/ssh
> >> rlogin_command               /usr/bin/ssh
> >> reprioritize                 0
> >>
> >>
> >>
> >>
> >> bash-3.00$   qconf -ssconf
> >> algorithm                         default
> >> schedule_interval                 0:0:15
> >> maxujobs                          0
> >> queue_sort_method                 seqno
> >> job_load_adjustments              np_load_avg=0.50
> >> load_adjustment_decay_time        0:7:30
> >> load_formula                      np_load_avg
> >> schedd_job_info                   true
> >> flush_submit_sec                  0
> >> flush_finish_sec                  0
> >> params                            none
> >> reprioritize_interval             0:0:0
> >> halftime                          168
> >> usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
> >> compensation_factor               5.000000
> >> weight_user                       0.250000
> >> weight_project                    0.250000
> >> weight_department                 0.250000
> >> weight_job                        0.250000
> >> weight_tickets_functional         0
> >> weight_tickets_share              0
> >> share_override_tickets            TRUE
> >> share_functional_shares           TRUE
> >> max_functional_jobs_to_schedule   200
> >> report_pjob_tickets               TRUE
> >> max_pending_tasks_per_job         50
> >> halflife_decay_list               none
> >> policy_hierarchy                  OFS
> >> weight_ticket                     0.010000
> >> weight_waiting_time               0.000000
> >> weight_deadline                   3600000.000000
> >> weight_urgency                    0.100000
> >> weight_priority                   1.000000
> >> max_reservation                   0
> >> default_duration                  0:10:0
> >>
> >>
> >>
> >> bash-3.00$   qconf -sq \*
> >> qname                 master.101.q
> >> hostlist              ts101-1-0.sss.se.scania.com
> >> seq_no                0
> >> load_thresholds       np_load_avg=1.75
> >> suspend_thresholds    NONE
> >> nsuspend              1
> >> suspend_interval      00:05:00
> >> priority              0
> >> min_cpu_interval      00:05:00
> >> processors            UNDEFINED
> >> qtype                 BATCH INTERACTIVE
> >> ckpt_list             NONE
> >> pe_list               dummy_ts101_pe fire_101_pe fluent_ts101_pe make \
> >>                       mpich_ts101_pe powerflow_ts101_pe
> >> rerun                 FALSE
> >> slots                 1
> >> tmpdir                /tmp
> >> shell                 /bin/csh
> >> prolog                NONE
> >> epilog                NONE
> >> shell_start_mode      posix_compliant
> >> starter_method        NONE
> >> suspend_method        NONE
> >> resume_method         NONE
> >> terminate_method      NONE
> >> notify                00:00:60
> >> owner_list            NONE
> >> user_lists            NONE
> >> xuser_lists           NONE
> >> subordinate_list      NONE
> >> complex_values        NONE
> >> projects              NONE
> >> xprojects             NONE
> >> calendar              NONE
> >> initial_state         default
> >> s_rt                  INFINITY
> >> h_rt                  INFINITY
> >> s_cpu                 INFINITY
> >> h_cpu                 INFINITY
> >> s_fsize               INFINITY
> >> h_fsize               INFINITY
> >> s_data                INFINITY
> >> h_data                INFINITY
> >> s_stack               INFINITY
> >> h_stack               INFINITY
> >> s_core                INFINITY
> >> h_core                INFINITY
> >> s_rss                 INFINITY
> >> h_rss                 INFINITY
> >> s_vmem                INFINITY
> >> h_vmem                INFINITY
> >>
> >>
> >>
> >> qname                 short.101.q
> >> hostlist              @ts101_X_hg
> >> seq_no                101,[ts101-1-0.sss.se.scania.com=9999]
> >> load_thresholds       np_load_avg=1.75
> >> suspend_thresholds    NONE
> >> nsuspend              1
> >> suspend_interval      00:05:00
> >> priority              0
> >> min_cpu_interval      00:05:00
> >> processors            UNDEFINED
> >> qtype                 BATCH INTERACTIVE
> >> ckpt_list             NONE
> >> pe_list               dummy_ts101_pe fire_101_pe fluent_ts101_pe make \
> >>                       mpich_ts101_pe powerflow_ts101_pe
> >> rerun                 FALSE
> >> slots                 4
> >> tmpdir                /tmp
> >> shell                 /bin/csh
> >> prolog                NONE
> >> epilog                NONE
> >> shell_start_mode      posix_compliant
> >> starter_method        NONE
> >> suspend_method        NONE
> >> resume_method         NONE
> >> terminate_method      NONE
> >> notify                00:00:60
> >> owner_list            NONE
> >> user_lists            NONE
> >> xuser_lists           NONE
> >> subordinate_list      NONE
> >> complex_values        NONE
> >> projects              NONE
> >> xprojects             NONE
> >> calendar              NONE
> >> initial_state         default
> >> s_rt                  INFINITY
> >> h_rt                  INFINITY
> >> s_cpu                 INFINITY
> >> h_cpu                 INFINITY
> >> s_fsize               INFINITY
> >> h_fsize               INFINITY
> >> s_data                INFINITY
> >> h_data                INFINITY
> >> s_stack               INFINITY
> >> h_stack               INFINITY
> >> s_core                INFINITY
> >> h_core                INFINITY
> >> s_rss                 INFINITY
> >> h_rss                 INFINITY
> >> s_vmem                INFINITY
> >> h_vmem                INFINITY
> >>
> >>
> >>
> >>
> >>
> >> bash-3.00$ qconf -se global
> >> hostname              global
> >> load_scaling          NONE
> >> complex_values        fluent_all=10,fluent_par=48,gtpowerx=7,dyna=18
> >> load_values           dyna=2,fluent_all=8,fluent_par=41,gtpowerx=2
> >> processors            0
> >> user_lists            NONE
> >> xuser_lists           NONE
> >> projects              NONE
> >> xprojects             NONE
> >> usage_scaling         NONE
> >> report_variables      NONE
> >>
> >>   
> >> ------------------------------------------------------------------------
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>   
> >>     
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> >
> >
> >   
> > ------------------------------------------------------------------------
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >   
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list