[GE users] Wildcards in PE still broken in 6.0u3

Reuti reuti at staff.uni-marburg.de
Fri Apr 29 17:15:59 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Tim,

thanks, but I'm still not sure about your setup. You stated that you have 32 
dual machines. So you made a hostgroup @mymachine-0 with which machines setup 
therein? - And why so many queues at all?

CU - Reuti

Quoting Tim Mueller <tim_mueller at hotmail.com>:

> Hi,
> 
> I get:
> 
>      59 0.55500 Job1    user        r     04/29/2005 10:47:08 
> mymachine.q.0 at local0                     8
>        Full jobname:     Job1
>        Master queue:     mymachine.q.0 at local0
>        Requested PE:     *.mpi 8
>        Granted PE:       mymachine.3.mpi 8
>        Hard Resources:
>        Soft Resources:
>      47 0.55500 Job2    user        r     04/27/2005 14:45:04 
> mymachine.q.0 at local6                     8
>        Full jobname:     Job2
>        Master queue:     mymachine.q.0 at local6
>        Requested PE:     *.mpi 8
>        Granted PE:       mymachine.3.mpi 8
>        Hard Resources:
>        Soft Resources:
>      44 0.55500 Job3    user        r     04/27/2005 11:55:49 
> mymachineq.1 at local12                    8
>        Full jobname:     Job3
>        Master queue:     mymachine.q.1 at local12
>        Requested PE:     *.mpi 8
>        Granted PE:       mymachine.3.mpi 8
>        Hard Resources:
>        Soft Resources:
>      60 0.55500 Job4    user        r     04/29/2005 10:55:53 
> mymachine.q.1 at local9                     8
>        Full jobname:     Job4
>        Master queue:     mymachine.q.1 at local9
>        Requested PE:     *.mpi 8
>        Granted PE:       mymachine.3.mpi 8
>        Hard Resources:
>        Soft Resources:
>      49 0.55500 Job5    user        r     04/27/2005 15:01:53 
> mymachine.q.2 at local16                    8
>        Full jobname:     Job5
>        Master queue:     mymachine.q.2 at local16
>        Requested PE:     *.mpi 8
>        Granted PE:       mymachine.3.mpi 8
>        Hard Resources:
>        Soft Resources:
>      48 0.55500 Job6    user        r     04/27/2005 14:57:53 
> mymachine.q.2 at local20                    8
>        Full jobname:     Job6
>        Master queue:     mymachine.q.2 at local20
>        Requested PE:     *.mpi 8
>        Granted PE:       mymachine.3.mpi 8
>        Hard Resources:
>        Soft Resources:
>      61 0.55500 Job7    user        r    04/29/2005 11:19:54 
> 8
>        Full jobname:     Job7
>        Requested PE:     *.mpi 8
>        Hard Resources:
>        Soft Resources:
> 
> When I do qconf -sp mymachine.3.mpi, I get:
> 
> pe_name           mymachine.3.mpi
> slots             16
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     avg
> 
> When I do qconf -sq mymachine.q.0, I get
> 
> qname                 mymachine.q.0
> hostlist              @mymachine-0
> seq_no                0
> load_thresholds       NONE
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               mymachine.0.mpi
> rerun                 FALSE
> slots                 2
> tmpdir                /tmp
> shell                 /bin/bash
> prolog                NONE
> epilog                NONE
> shell_start_mode      posix_compliant
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      NONE
> notify                00:00:60
> owner_list            sgeadmin
> user_lists            NONE
> xuser_lists           NONE
> subordinate_list      NONE
> complex_values        NONE
> projects              NONE
> xprojects             NONE
> calendar              NONE
> initial_state         default
> s_rt                  84:00:00
> h_rt                  84:15:00
> s_cpu                 INFINITY
> h_cpu                 INFINITY
> s_fsize               INFINITY
> h_fsize               INFINITY
> s_data                INFINITY
> h_data                INFINITY
> s_stack               INFINITY
> h_stack               INFINITY
> s_core                INFINITY
> h_core                INFINITY
> s_rss                 1G
> h_rss                 1G
> s_vmem                INFINITY
> h_vmem                INFINITY
> 
> And so on, up to mymachine.q.3.
> 
> Tim
> 
> ----- Original Message ----- 
> From: "Reuti" <reuti at staff.uni-marburg.de>
> To: <users at gridengine.sunsource.net>
> Sent: Friday, April 29, 2005 11:14 AM
> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> 
> 
> > Hi Tim,
> >
> > what is:
> >
> > qstat -r
> >
> > showing as granted PEs? - Reuti
> >
> >
> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
> >
> >> Hi,
> >>
> >> That's the problem.  The setup is actually
> >>
> >> mymachine.q.0 references mymachine.0.mpi
> >> mymachine.q.1 references mymachine.1.mpi
> >> mymachine.q.2 references mymachine.2.mpi
> >> mymachine.q.3 references mymachine.3.mpi
> >>
> >> There is no reason, as far as I can tell, that a job could ever be in 
> >> both
> >> mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I use 
> >> wildcards,
> >>
> >> the the scheduler won't put a job assigned to mymachine.3.mpi into
> >> mymachine.q.3 until all of the other queues are full.  At that point, 
> >> it's
> >> too late because mymachine.3.mpi is using 48 slots, when it's only 
> >> allowed
> >> to use up to 16.
> >>
> >> When I don't use wildcards, I get the behavior I expect:  A job submitted
> 
> >> to
> >>
> >> mymachine.3.mpi gets put in mymachine.q.3, etc.
> >>
> >> Tim
> >>
> >> ----- Original Message ----- 
> >> From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
> >> <stephan.grell at sun.com>
> >> To: <users at gridengine.sunsource.net>
> >> Sent: Friday, April 29, 2005 2:34 AM
> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> >>
> >>
> >> > Hi Tim,
> >> >
> >> > I am not quite sure I understand your setup. Could you please attach 
> >> > your
> >>
> >> > cqueue configuration? From
> >> > the results you posted, it reads as if:
> >> > queue
> >> > mymachine.q.0  references mymachine.3.mpi
> >> > mymachine.q.1  reference mymachine.3.mpi
> >> >
> >> > and so on.
> >> >
> >> > Cheers,
> >> > Stephan
> >> >
> >> > Tim Mueller wrote:
> >> >
> >> >> Hi,
> >> >>  It appears that wildcards in the Parallel Environment name still
> have
> >> >> problems in 6.0u3.  I have set up a linux cluster of 32 dual
> processor
> >> >> Noconas running Linux.  There are 4 queues of 16 processors each, and
> 
> >> >> a
> >> >> corresponding pe for each queue.  The queues are named as follows:
> >> >>  mymachine.q.0
> >> >> mymachine.q.1
> >> >> mymachine.q.2
> >> >> mymachine.q.3
> >> >>  And the PE's are
> >> >>  mymachine.0.mpi
> >> >> mymachine.1.mpi
> >> >> mymachine.2.mpi
> >> >> mymachine.3.mpi
> >> >>  All of the PE's have 16 slots.  When I submit a job with the 
> >> >> following
> >> >> line:
> >> >>  #$ -pe *.mpi 8
> >> >>  the job will be assigned to a seemingly random PE, but then placed in
> 
> >> >> a
> >>
> >> >> queue that does not correspond to that PE.  I can submit up to 6 jobs
> >> >> this way, each of which will get assigned to the same PE and placed
> in
> >> >> any queue that does not correspond to the PE.  This causes 48 
> >> >> processors
> >>
> >> >> to be used for a PE with only 16 slots.  E.g., I might get:
> >> >>  Job 1        mymachine.3.mpi        mymachine.q.0        8
> processors
> >> >> Job 2        mymachine.3.mpi        mymachine.q.0        8 processors
> >> >> Job 3        mymachine.3.mpi        mymachine.q.1        8 processors
> >> >> Job 4        mymachine.3.mpi        mymachine.q.1        8 processors
> >> >> Job 5        mymachine.3.mpi        mymachine.q.2        8 processors
> >> >> Job 6        mymachine.3.mpi        mymachine.q.2        8 processors
> >> >> Job 7        qw
> >> >> Job 8        qw
> >> >>  When I should get:
> >> >>  Job 1        mymachine.0.mpi        mymachine.q.0        8
> processors
> >> >> Job 2        mymachine.0.mpi        mymachine.q.0        8 processors
> >> >> Job 3        mymachine.1.mpi        mymachine.q.1        8 processors
> >> >> Job 4        mymachine.1.mpi        mymachine.q.1        8 processors
> >> >> Job 5        mymachine.2.mpi        mymachine.q.2        8 processors
> >> >> Job 6        mymachine.2.mpi        mymachine.q.2        8 processors
> >> >> Job 5        mymachine.3.mpi        mymachine.q.3        8 processors
> >> >> Job 6        mymachine.3.mpi        mymachine.q.3        8 processors
> >> >>  If I try to then submit a job directly (with no wildcard) to the PE 
> >> >> that
> >>
> >> >> all of the jobs were assigned to, it will not run because I have 
> >> >> already
> >>
> >> >> far exceeded the slots limit for this PE.
> >> >>  I should note that when I do not use wildcards, everything behaves as
> 
> >> >> it
> >>
> >> >> should.  E.g, a job submitted to mymachine.2.mpi will be assigned to
> >> >> mymachine.2.mpi and mymachine.2.q, and I cannot use more than 16 slots
> 
> >> >> in
> >>
> >> >> mymachine.2.mpi at once.
> >> >>  I searched the list, and although there seem to have been other 
> >> >> problems
> >>
> >> >> with wildcards in the past, I have seen nothing that references this
> >> >> behavior.  Does anyone have an explanation / workaround?
> >> >>  Tim
> >> >
> >> >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >> >
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list