[GE users] Wildcards in PE still broken in 6.0u3

Tim Mueller tim_mueller at hotmail.com
Fri Apr 29 17:46:03 BST 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

There are 32 machines, each dual-processor with names

local0
local1
..
local31

They are grouped together with four 8-port gigabit switches.  Each group was 
given a queue, a PE, and a hostgroup.  So for example @mymachine-0 contains

local0
local1
..
local7

local0-local7 are all connected via both the central cluster switch and a 
local gigabit switch.

I should also note that I am using hostname aliasing to ensure that the 
ethernet interface connected to the gigabit switch is used by Grid Engine. 
So I have a file host_aliases file set up as follows:

local0 node0
local1 node1
..
local31 node31

Where "nodeX" is the primary hostname for each machine and resolves to the 
interface that connets to the central cluster switch.  "localX" resolves to 
an address that connects via the gigabit interface if possible.  The 
"localX" names do not resolve consistently across the cluster -- for example 
if I am on node0 and I ping local1, it will do so over the gigabit 
interface.  However if I am on node31 and I ping local1, it will do so over 
the non-gigabit interface, because there is no gigabit connection between 
node31 and node1.

Tim

----- Original Message ----- 
From: "Reuti" <reuti at staff.uni-marburg.de>
To: <users at gridengine.sunsource.net>
Sent: Friday, April 29, 2005 12:15 PM
Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3


> Tim,
>
> thanks, but I'm still not sure about your setup. You stated that you have 
> 32
> dual machines. So you made a hostgroup @mymachine-0 with which machines 
> setup
> therein? - And why so many queues at all?
>
> CU - Reuti
>
> Quoting Tim Mueller <tim_mueller at hotmail.com>:
>
>> Hi,
>>
>> I get:
>>
>>      59 0.55500 Job1    user        r     04/29/2005 10:47:08
>> mymachine.q.0 at local0                     8
>>        Full jobname:     Job1
>>        Master queue:     mymachine.q.0 at local0
>>        Requested PE:     *.mpi 8
>>        Granted PE:       mymachine.3.mpi 8
>>        Hard Resources:
>>        Soft Resources:
>>      47 0.55500 Job2    user        r     04/27/2005 14:45:04
>> mymachine.q.0 at local6                     8
>>        Full jobname:     Job2
>>        Master queue:     mymachine.q.0 at local6
>>        Requested PE:     *.mpi 8
>>        Granted PE:       mymachine.3.mpi 8
>>        Hard Resources:
>>        Soft Resources:
>>      44 0.55500 Job3    user        r     04/27/2005 11:55:49
>> mymachineq.1 at local12                    8
>>        Full jobname:     Job3
>>        Master queue:     mymachine.q.1 at local12
>>        Requested PE:     *.mpi 8
>>        Granted PE:       mymachine.3.mpi 8
>>        Hard Resources:
>>        Soft Resources:
>>      60 0.55500 Job4    user        r     04/29/2005 10:55:53
>> mymachine.q.1 at local9                     8
>>        Full jobname:     Job4
>>        Master queue:     mymachine.q.1 at local9
>>        Requested PE:     *.mpi 8
>>        Granted PE:       mymachine.3.mpi 8
>>        Hard Resources:
>>        Soft Resources:
>>      49 0.55500 Job5    user        r     04/27/2005 15:01:53
>> mymachine.q.2 at local16                    8
>>        Full jobname:     Job5
>>        Master queue:     mymachine.q.2 at local16
>>        Requested PE:     *.mpi 8
>>        Granted PE:       mymachine.3.mpi 8
>>        Hard Resources:
>>        Soft Resources:
>>      48 0.55500 Job6    user        r     04/27/2005 14:57:53
>> mymachine.q.2 at local20                    8
>>        Full jobname:     Job6
>>        Master queue:     mymachine.q.2 at local20
>>        Requested PE:     *.mpi 8
>>        Granted PE:       mymachine.3.mpi 8
>>        Hard Resources:
>>        Soft Resources:
>>      61 0.55500 Job7    user        r    04/29/2005 11:19:54
>> 8
>>        Full jobname:     Job7
>>        Requested PE:     *.mpi 8
>>        Hard Resources:
>>        Soft Resources:
>>
>> When I do qconf -sp mymachine.3.mpi, I get:
>>
>> pe_name           mymachine.3.mpi
>> slots             16
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /bin/true
>> stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     avg
>>
>> When I do qconf -sq mymachine.q.0, I get
>>
>> qname                 mymachine.q.0
>> hostlist              @mymachine-0
>> seq_no                0
>> load_thresholds       NONE
>> suspend_thresholds    NONE
>> nsuspend              1
>> suspend_interval      00:05:00
>> priority              0
>> min_cpu_interval      00:05:00
>> processors            UNDEFINED
>> qtype                 BATCH INTERACTIVE
>> ckpt_list             NONE
>> pe_list               mymachine.0.mpi
>> rerun                 FALSE
>> slots                 2
>> tmpdir                /tmp
>> shell                 /bin/bash
>> prolog                NONE
>> epilog                NONE
>> shell_start_mode      posix_compliant
>> starter_method        NONE
>> suspend_method        NONE
>> resume_method         NONE
>> terminate_method      NONE
>> notify                00:00:60
>> owner_list            sgeadmin
>> user_lists            NONE
>> xuser_lists           NONE
>> subordinate_list      NONE
>> complex_values        NONE
>> projects              NONE
>> xprojects             NONE
>> calendar              NONE
>> initial_state         default
>> s_rt                  84:00:00
>> h_rt                  84:15:00
>> s_cpu                 INFINITY
>> h_cpu                 INFINITY
>> s_fsize               INFINITY
>> h_fsize               INFINITY
>> s_data                INFINITY
>> h_data                INFINITY
>> s_stack               INFINITY
>> h_stack               INFINITY
>> s_core                INFINITY
>> h_core                INFINITY
>> s_rss                 1G
>> h_rss                 1G
>> s_vmem                INFINITY
>> h_vmem                INFINITY
>>
>> And so on, up to mymachine.q.3.
>>
>> Tim
>>
>> ----- Original Message ----- 
>> From: "Reuti" <reuti at staff.uni-marburg.de>
>> To: <users at gridengine.sunsource.net>
>> Sent: Friday, April 29, 2005 11:14 AM
>> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>
>>
>> > Hi Tim,
>> >
>> > what is:
>> >
>> > qstat -r
>> >
>> > showing as granted PEs? - Reuti
>> >
>> >
>> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
>> >
>> >> Hi,
>> >>
>> >> That's the problem.  The setup is actually
>> >>
>> >> mymachine.q.0 references mymachine.0.mpi
>> >> mymachine.q.1 references mymachine.1.mpi
>> >> mymachine.q.2 references mymachine.2.mpi
>> >> mymachine.q.3 references mymachine.3.mpi
>> >>
>> >> There is no reason, as far as I can tell, that a job could ever be in
>> >> both
>> >> mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I use
>> >> wildcards,
>> >>
>> >> the the scheduler won't put a job assigned to mymachine.3.mpi into
>> >> mymachine.q.3 until all of the other queues are full.  At that point,
>> >> it's
>> >> too late because mymachine.3.mpi is using 48 slots, when it's only
>> >> allowed
>> >> to use up to 16.
>> >>
>> >> When I don't use wildcards, I get the behavior I expect:  A job 
>> >> submitted
>>
>> >> to
>> >>
>> >> mymachine.3.mpi gets put in mymachine.q.3, etc.
>> >>
>> >> Tim
>> >>
>> >> ----- Original Message ----- 
>> >> From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
>> >> <stephan.grell at sun.com>
>> >> To: <users at gridengine.sunsource.net>
>> >> Sent: Friday, April 29, 2005 2:34 AM
>> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>> >>
>> >>
>> >> > Hi Tim,
>> >> >
>> >> > I am not quite sure I understand your setup. Could you please attach
>> >> > your
>> >>
>> >> > cqueue configuration? From
>> >> > the results you posted, it reads as if:
>> >> > queue
>> >> > mymachine.q.0  references mymachine.3.mpi
>> >> > mymachine.q.1  reference mymachine.3.mpi
>> >> >
>> >> > and so on.
>> >> >
>> >> > Cheers,
>> >> > Stephan
>> >> >
>> >> > Tim Mueller wrote:
>> >> >
>> >> >> Hi,
>> >> >>  It appears that wildcards in the Parallel Environment name still
>> have
>> >> >> problems in 6.0u3.  I have set up a linux cluster of 32 dual
>> processor
>> >> >> Noconas running Linux.  There are 4 queues of 16 processors each, 
>> >> >> and
>>
>> >> >> a
>> >> >> corresponding pe for each queue.  The queues are named as follows:
>> >> >>  mymachine.q.0
>> >> >> mymachine.q.1
>> >> >> mymachine.q.2
>> >> >> mymachine.q.3
>> >> >>  And the PE's are
>> >> >>  mymachine.0.mpi
>> >> >> mymachine.1.mpi
>> >> >> mymachine.2.mpi
>> >> >> mymachine.3.mpi
>> >> >>  All of the PE's have 16 slots.  When I submit a job with the
>> >> >> following
>> >> >> line:
>> >> >>  #$ -pe *.mpi 8
>> >> >>  the job will be assigned to a seemingly random PE, but then placed 
>> >> >> in
>>
>> >> >> a
>> >>
>> >> >> queue that does not correspond to that PE.  I can submit up to 6 
>> >> >> jobs
>> >> >> this way, each of which will get assigned to the same PE and placed
>> in
>> >> >> any queue that does not correspond to the PE.  This causes 48
>> >> >> processors
>> >>
>> >> >> to be used for a PE with only 16 slots.  E.g., I might get:
>> >> >>  Job 1        mymachine.3.mpi        mymachine.q.0        8
>> processors
>> >> >> Job 2        mymachine.3.mpi        mymachine.q.0        8 
>> >> >> processors
>> >> >> Job 3        mymachine.3.mpi        mymachine.q.1        8 
>> >> >> processors
>> >> >> Job 4        mymachine.3.mpi        mymachine.q.1        8 
>> >> >> processors
>> >> >> Job 5        mymachine.3.mpi        mymachine.q.2        8 
>> >> >> processors
>> >> >> Job 6        mymachine.3.mpi        mymachine.q.2        8 
>> >> >> processors
>> >> >> Job 7        qw
>> >> >> Job 8        qw
>> >> >>  When I should get:
>> >> >>  Job 1        mymachine.0.mpi        mymachine.q.0        8
>> processors
>> >> >> Job 2        mymachine.0.mpi        mymachine.q.0        8 
>> >> >> processors
>> >> >> Job 3        mymachine.1.mpi        mymachine.q.1        8 
>> >> >> processors
>> >> >> Job 4        mymachine.1.mpi        mymachine.q.1        8 
>> >> >> processors
>> >> >> Job 5        mymachine.2.mpi        mymachine.q.2        8 
>> >> >> processors
>> >> >> Job 6        mymachine.2.mpi        mymachine.q.2        8 
>> >> >> processors
>> >> >> Job 5        mymachine.3.mpi        mymachine.q.3        8 
>> >> >> processors
>> >> >> Job 6        mymachine.3.mpi        mymachine.q.3        8 
>> >> >> processors
>> >> >>  If I try to then submit a job directly (with no wildcard) to the 
>> >> >> PE
>> >> >> that
>> >>
>> >> >> all of the jobs were assigned to, it will not run because I have
>> >> >> already
>> >>
>> >> >> far exceeded the slots limit for this PE.
>> >> >>  I should note that when I do not use wildcards, everything behaves 
>> >> >> as
>>
>> >> >> it
>> >>
>> >> >> should.  E.g, a job submitted to mymachine.2.mpi will be assigned 
>> >> >> to
>> >> >> mymachine.2.mpi and mymachine.2.q, and I cannot use more than 16 
>> >> >> slots
>>
>> >> >> in
>> >>
>> >> >> mymachine.2.mpi at once.
>> >> >>  I searched the list, and although there seem to have been other
>> >> >> problems
>> >>
>> >> >> with wildcards in the past, I have seen nothing that references 
>> >> >> this
>> >> >> behavior.  Does anyone have an explanation / workaround?
>> >> >>  Tim
>> >> >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >>
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list