[GE users] Wildcards in PE still broken in 6.0u3

Reuti reuti at staff.uni-marburg.de
Fri Apr 29 18:24:39 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Aha Tim,

now I understand your setup. As the naming of the masterq, e.g. 
mymachine.q.1 at local9 is inside your intended configuration, what shows

qstat -g t

Maybe the output of the granted PE is just wrong, but all is working as 
intended? - Reuti

Quoting Tim Mueller <tim_mueller at hotmail.com>:

> There are 32 machines, each dual-processor with names
> 
> local0
> local1
> ..
> local31
> 
> They are grouped together with four 8-port gigabit switches.  Each group was
> 
> given a queue, a PE, and a hostgroup.  So for example @mymachine-0 contains
> 
> local0
> local1
> ..
> local7
> 
> local0-local7 are all connected via both the central cluster switch and a 
> local gigabit switch.
> 
> I should also note that I am using hostname aliasing to ensure that the 
> ethernet interface connected to the gigabit switch is used by Grid Engine. 
> So I have a file host_aliases file set up as follows:
> 
> local0 node0
> local1 node1
> ..
> local31 node31
> 
> Where "nodeX" is the primary hostname for each machine and resolves to the 
> interface that connets to the central cluster switch.  "localX" resolves to
> 
> an address that connects via the gigabit interface if possible.  The 
> "localX" names do not resolve consistently across the cluster -- for example
> 
> if I am on node0 and I ping local1, it will do so over the gigabit 
> interface.  However if I am on node31 and I ping local1, it will do so over
> 
> the non-gigabit interface, because there is no gigabit connection between 
> node31 and node1.
> 
> Tim
> 
> ----- Original Message ----- 
> From: "Reuti" <reuti at staff.uni-marburg.de>
> To: <users at gridengine.sunsource.net>
> Sent: Friday, April 29, 2005 12:15 PM
> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> 
> 
> > Tim,
> >
> > thanks, but I'm still not sure about your setup. You stated that you have
> 
> > 32
> > dual machines. So you made a hostgroup @mymachine-0 with which machines 
> > setup
> > therein? - And why so many queues at all?
> >
> > CU - Reuti
> >
> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
> >
> >> Hi,
> >>
> >> I get:
> >>
> >>      59 0.55500 Job1    user        r     04/29/2005 10:47:08
> >> mymachine.q.0 at local0                     8
> >>        Full jobname:     Job1
> >>        Master queue:     mymachine.q.0 at local0
> >>        Requested PE:     *.mpi 8
> >>        Granted PE:       mymachine.3.mpi 8
> >>        Hard Resources:
> >>        Soft Resources:
> >>      47 0.55500 Job2    user        r     04/27/2005 14:45:04
> >> mymachine.q.0 at local6                     8
> >>        Full jobname:     Job2
> >>        Master queue:     mymachine.q.0 at local6
> >>        Requested PE:     *.mpi 8
> >>        Granted PE:       mymachine.3.mpi 8
> >>        Hard Resources:
> >>        Soft Resources:
> >>      44 0.55500 Job3    user        r     04/27/2005 11:55:49
> >> mymachineq.1 at local12                    8
> >>        Full jobname:     Job3
> >>        Master queue:     mymachine.q.1 at local12
> >>        Requested PE:     *.mpi 8
> >>        Granted PE:       mymachine.3.mpi 8
> >>        Hard Resources:
> >>        Soft Resources:
> >>      60 0.55500 Job4    user        r     04/29/2005 10:55:53
> >> mymachine.q.1 at local9                     8
> >>        Full jobname:     Job4
> >>        Master queue:     mymachine.q.1 at local9
> >>        Requested PE:     *.mpi 8
> >>        Granted PE:       mymachine.3.mpi 8
> >>        Hard Resources:
> >>        Soft Resources:
> >>      49 0.55500 Job5    user        r     04/27/2005 15:01:53
> >> mymachine.q.2 at local16                    8
> >>        Full jobname:     Job5
> >>        Master queue:     mymachine.q.2 at local16
> >>        Requested PE:     *.mpi 8
> >>        Granted PE:       mymachine.3.mpi 8
> >>        Hard Resources:
> >>        Soft Resources:
> >>      48 0.55500 Job6    user        r     04/27/2005 14:57:53
> >> mymachine.q.2 at local20                    8
> >>        Full jobname:     Job6
> >>        Master queue:     mymachine.q.2 at local20
> >>        Requested PE:     *.mpi 8
> >>        Granted PE:       mymachine.3.mpi 8
> >>        Hard Resources:
> >>        Soft Resources:
> >>      61 0.55500 Job7    user        r    04/29/2005 11:19:54
> >> 8
> >>        Full jobname:     Job7
> >>        Requested PE:     *.mpi 8
> >>        Hard Resources:
> >>        Soft Resources:
> >>
> >> When I do qconf -sp mymachine.3.mpi, I get:
> >>
> >> pe_name           mymachine.3.mpi
> >> slots             16
> >> user_lists        NONE
> >> xuser_lists       NONE
> >> start_proc_args   /bin/true
> >> stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
> >> allocation_rule   $round_robin
> >> control_slaves    TRUE
> >> job_is_first_task FALSE
> >> urgency_slots     avg
> >>
> >> When I do qconf -sq mymachine.q.0, I get
> >>
> >> qname                 mymachine.q.0
> >> hostlist              @mymachine-0
> >> seq_no                0
> >> load_thresholds       NONE
> >> suspend_thresholds    NONE
> >> nsuspend              1
> >> suspend_interval      00:05:00
> >> priority              0
> >> min_cpu_interval      00:05:00
> >> processors            UNDEFINED
> >> qtype                 BATCH INTERACTIVE
> >> ckpt_list             NONE
> >> pe_list               mymachine.0.mpi
> >> rerun                 FALSE
> >> slots                 2
> >> tmpdir                /tmp
> >> shell                 /bin/bash
> >> prolog                NONE
> >> epilog                NONE
> >> shell_start_mode      posix_compliant
> >> starter_method        NONE
> >> suspend_method        NONE
> >> resume_method         NONE
> >> terminate_method      NONE
> >> notify                00:00:60
> >> owner_list            sgeadmin
> >> user_lists            NONE
> >> xuser_lists           NONE
> >> subordinate_list      NONE
> >> complex_values        NONE
> >> projects              NONE
> >> xprojects             NONE
> >> calendar              NONE
> >> initial_state         default
> >> s_rt                  84:00:00
> >> h_rt                  84:15:00
> >> s_cpu                 INFINITY
> >> h_cpu                 INFINITY
> >> s_fsize               INFINITY
> >> h_fsize               INFINITY
> >> s_data                INFINITY
> >> h_data                INFINITY
> >> s_stack               INFINITY
> >> h_stack               INFINITY
> >> s_core                INFINITY
> >> h_core                INFINITY
> >> s_rss                 1G
> >> h_rss                 1G
> >> s_vmem                INFINITY
> >> h_vmem                INFINITY
> >>
> >> And so on, up to mymachine.q.3.
> >>
> >> Tim
> >>
> >> ----- Original Message ----- 
> >> From: "Reuti" <reuti at staff.uni-marburg.de>
> >> To: <users at gridengine.sunsource.net>
> >> Sent: Friday, April 29, 2005 11:14 AM
> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> >>
> >>
> >> > Hi Tim,
> >> >
> >> > what is:
> >> >
> >> > qstat -r
> >> >
> >> > showing as granted PEs? - Reuti
> >> >
> >> >
> >> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
> >> >
> >> >> Hi,
> >> >>
> >> >> That's the problem.  The setup is actually
> >> >>
> >> >> mymachine.q.0 references mymachine.0.mpi
> >> >> mymachine.q.1 references mymachine.1.mpi
> >> >> mymachine.q.2 references mymachine.2.mpi
> >> >> mymachine.q.3 references mymachine.3.mpi
> >> >>
> >> >> There is no reason, as far as I can tell, that a job could ever be in
> >> >> both
> >> >> mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I use
> >> >> wildcards,
> >> >>
> >> >> the the scheduler won't put a job assigned to mymachine.3.mpi into
> >> >> mymachine.q.3 until all of the other queues are full.  At that point,
> >> >> it's
> >> >> too late because mymachine.3.mpi is using 48 slots, when it's only
> >> >> allowed
> >> >> to use up to 16.
> >> >>
> >> >> When I don't use wildcards, I get the behavior I expect:  A job 
> >> >> submitted
> >>
> >> >> to
> >> >>
> >> >> mymachine.3.mpi gets put in mymachine.q.3, etc.
> >> >>
> >> >> Tim
> >> >>
> >> >> ----- Original Message ----- 
> >> >> From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
> >> >> <stephan.grell at sun.com>
> >> >> To: <users at gridengine.sunsource.net>
> >> >> Sent: Friday, April 29, 2005 2:34 AM
> >> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> >> >>
> >> >>
> >> >> > Hi Tim,
> >> >> >
> >> >> > I am not quite sure I understand your setup. Could you please
> attach
> >> >> > your
> >> >>
> >> >> > cqueue configuration? From
> >> >> > the results you posted, it reads as if:
> >> >> > queue
> >> >> > mymachine.q.0  references mymachine.3.mpi
> >> >> > mymachine.q.1  reference mymachine.3.mpi
> >> >> >
> >> >> > and so on.
> >> >> >
> >> >> > Cheers,
> >> >> > Stephan
> >> >> >
> >> >> > Tim Mueller wrote:
> >> >> >
> >> >> >> Hi,
> >> >> >>  It appears that wildcards in the Parallel Environment name still
> >> have
> >> >> >> problems in 6.0u3.  I have set up a linux cluster of 32 dual
> >> processor
> >> >> >> Noconas running Linux.  There are 4 queues of 16 processors each, 
> >> >> >> and
> >>
> >> >> >> a
> >> >> >> corresponding pe for each queue.  The queues are named as follows:
> >> >> >>  mymachine.q.0
> >> >> >> mymachine.q.1
> >> >> >> mymachine.q.2
> >> >> >> mymachine.q.3
> >> >> >>  And the PE's are
> >> >> >>  mymachine.0.mpi
> >> >> >> mymachine.1.mpi
> >> >> >> mymachine.2.mpi
> >> >> >> mymachine.3.mpi
> >> >> >>  All of the PE's have 16 slots.  When I submit a job with the
> >> >> >> following
> >> >> >> line:
> >> >> >>  #$ -pe *.mpi 8
> >> >> >>  the job will be assigned to a seemingly random PE, but then placed
> 
> >> >> >> in
> >>
> >> >> >> a
> >> >>
> >> >> >> queue that does not correspond to that PE.  I can submit up to 6 
> >> >> >> jobs
> >> >> >> this way, each of which will get assigned to the same PE and
> placed
> >> in
> >> >> >> any queue that does not correspond to the PE.  This causes 48
> >> >> >> processors
> >> >>
> >> >> >> to be used for a PE with only 16 slots.  E.g., I might get:
> >> >> >>  Job 1        mymachine.3.mpi        mymachine.q.0        8
> >> processors
> >> >> >> Job 2        mymachine.3.mpi        mymachine.q.0        8 
> >> >> >> processors
> >> >> >> Job 3        mymachine.3.mpi        mymachine.q.1        8 
> >> >> >> processors
> >> >> >> Job 4        mymachine.3.mpi        mymachine.q.1        8 
> >> >> >> processors
> >> >> >> Job 5        mymachine.3.mpi        mymachine.q.2        8 
> >> >> >> processors
> >> >> >> Job 6        mymachine.3.mpi        mymachine.q.2        8 
> >> >> >> processors
> >> >> >> Job 7        qw
> >> >> >> Job 8        qw
> >> >> >>  When I should get:
> >> >> >>  Job 1        mymachine.0.mpi        mymachine.q.0        8
> >> processors
> >> >> >> Job 2        mymachine.0.mpi        mymachine.q.0        8 
> >> >> >> processors
> >> >> >> Job 3        mymachine.1.mpi        mymachine.q.1        8 
> >> >> >> processors
> >> >> >> Job 4        mymachine.1.mpi        mymachine.q.1        8 
> >> >> >> processors
> >> >> >> Job 5        mymachine.2.mpi        mymachine.q.2        8 
> >> >> >> processors
> >> >> >> Job 6        mymachine.2.mpi        mymachine.q.2        8 
> >> >> >> processors
> >> >> >> Job 5        mymachine.3.mpi        mymachine.q.3        8 
> >> >> >> processors
> >> >> >> Job 6        mymachine.3.mpi        mymachine.q.3        8 
> >> >> >> processors
> >> >> >>  If I try to then submit a job directly (with no wildcard) to the 
> >> >> >> PE
> >> >> >> that
> >> >>
> >> >> >> all of the jobs were assigned to, it will not run because I have
> >> >> >> already
> >> >>
> >> >> >> far exceeded the slots limit for this PE.
> >> >> >>  I should note that when I do not use wildcards, everything behaves
> 
> >> >> >> as
> >>
> >> >> >> it
> >> >>
> >> >> >> should.  E.g, a job submitted to mymachine.2.mpi will be assigned 
> >> >> >> to
> >> >> >> mymachine.2.mpi and mymachine.2.q, and I cannot use more than 16 
> >> >> >> slots
> >>
> >> >> >> in
> >> >>
> >> >> >> mymachine.2.mpi at once.
> >> >> >>  I searched the list, and although there seem to have been other
> >> >> >> problems
> >> >>
> >> >> >> with wildcards in the past, I have seen nothing that references 
> >> >> >> this
> >> >> >> behavior.  Does anyone have an explanation / workaround?
> >> >> >>  Tim
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ---------------------------------------------------------------------
> >> >> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> >> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >> >> >
> >> >> >
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >> >>
> >> >
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list