[GE users] Wildcards in PE still broken in 6.0u3

Tim Mueller tim_mueller at hotmail.com
Fri Apr 29 19:11:44 BST 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

That's what I had hoped initially.  However, it does not explain why no jobs 
get assigned to mymachine.q.3, which is the only queue to which they should 
get assigned.  It appears that jobs get rejected from this queue because the 
scheduler believes mymachine.3.mpi is too full.

qstat -g t gives the following:

job-ID  prior   name       user         state submit/start at     queue 
master ja-task-ID
------------------------------------------------------------------------------------------------------------------
     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
mymachine.q.0 at local0                 SLAVE
     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
mymachine.q.0 at local0                 MASTER
                                                                  mymachine.q.0 at local0 
SLAVE
     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
mymachine.q.0 at local1                 SLAVE
     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
mymachine.q.0 at local1                 SLAVE
     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
mymachine.q.0 at local2                 SLAVE
     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
mymachine.q.0 at local2                 SLAVE
     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
mymachine.q.0 at local3                 SLAVE
     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
mymachine.q.0 at local3                 SLAVE
     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
mymachine.q.0 at local4                 SLAVE
     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
mymachine.q.0 at local4                 SLAVE
     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
mymachine.q.0 at local5                 SLAVE
     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
mymachine.q.0 at local5                 SLAVE
     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
mymachine.q.0 at local6                 MASTER
                                                                  mymachine.q.0 at local6 
SLAVE
     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
mymachine.q.0 at local6                 SLAVE
     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
mymachine.q.0 at local7                 SLAVE
     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
mymachine.q.0 at local7                 SLAVE
     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
mymachine.q.1 at local10                SLAVE
     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
mymachine.q.1 at local10                SLAVE
     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
mymachine.q.1 at local11                SLAVE
     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
mymachine.q.1 at local11                SLAVE
     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
mymachine.q.1 at local12                MASTER
                                                                  mymachine.q.1 at local12 
SLAVE
     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
mymachine.q.1 at local12                SLAVE
     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
mymachine.q.1 at local13                SLAVE
     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
mymachine.q.1 at local13                SLAVE
     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
mymachine.q.1 at local14                SLAVE
     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
mymachine.q.1 at local14                SLAVE
     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
mymachine.q.1 at local15                SLAVE
     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
mymachine.q.1 at local15                SLAVE
     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
mymachine.q.1 at local8                 SLAVE
     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
mymachine.q.1 at local8                 SLAVE
     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
mymachine.q.1 at local9                 SLAVE
     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
mymachine.q.1 at local9                 MASTER
                                                                  mymachine.q.1 at local9 
SLAVE
     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
mymachine.q.2 at local16                SLAVE
     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
mymachine.q.2 at local16                MASTER
                                                                  mymachine.q.2 at local16 
SLAVE
     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
mymachine.q.2 at local17                SLAVE
     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
mymachine.q.2 at local17                SLAVE
     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
mymachine.q.2 at local18                SLAVE
     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
mymachine.q.2 at local18                SLAVE
     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
mymachine.q.2 at local19                SLAVE
     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
mymachine.q.2 at local19                SLAVE
     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
mymachine.q.2 at local20                MASTER
                                                                  mymachine.q.2 at local20 
SLAVE
     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
mymachine.q.2 at local20                SLAVE
     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
mymachine.q.2 at local21                SLAVE
     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
mymachine.q.2 at local21                SLAVE
     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
mymachine.q.2 at local22                SLAVE
     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
mymachine.q.2 at local22                SLAVE
     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
mymachine.q.2 at local23                SLAVE
     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
mymachine.q.2 at local23                SLAVE
     61 0.55500 Job7        user        qw    04/29/2005 11:19:54

Tim

----- Original Message ----- 
From: "Reuti" <reuti at staff.uni-marburg.de>
To: <users at gridengine.sunsource.net>
Sent: Friday, April 29, 2005 1:24 PM
Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3


> Aha Tim,
>
> now I understand your setup. As the naming of the masterq, e.g.
> mymachine.q.1 at local9 is inside your intended configuration, what shows
>
> qstat -g t
>
> Maybe the output of the granted PE is just wrong, but all is working as
> intended? - Reuti
>
> Quoting Tim Mueller <tim_mueller at hotmail.com>:
>
>> There are 32 machines, each dual-processor with names
>>
>> local0
>> local1
>> ..
>> local31
>>
>> They are grouped together with four 8-port gigabit switches.  Each group 
>> was
>>
>> given a queue, a PE, and a hostgroup.  So for example @mymachine-0 
>> contains
>>
>> local0
>> local1
>> ..
>> local7
>>
>> local0-local7 are all connected via both the central cluster switch and a
>> local gigabit switch.
>>
>> I should also note that I am using hostname aliasing to ensure that the
>> ethernet interface connected to the gigabit switch is used by Grid 
>> Engine.
>> So I have a file host_aliases file set up as follows:
>>
>> local0 node0
>> local1 node1
>> ..
>> local31 node31
>>
>> Where "nodeX" is the primary hostname for each machine and resolves to 
>> the
>> interface that connets to the central cluster switch.  "localX" resolves 
>> to
>>
>> an address that connects via the gigabit interface if possible.  The
>> "localX" names do not resolve consistently across the cluster -- for 
>> example
>>
>> if I am on node0 and I ping local1, it will do so over the gigabit
>> interface.  However if I am on node31 and I ping local1, it will do so 
>> over
>>
>> the non-gigabit interface, because there is no gigabit connection between
>> node31 and node1.
>>
>> Tim
>>
>> ----- Original Message ----- 
>> From: "Reuti" <reuti at staff.uni-marburg.de>
>> To: <users at gridengine.sunsource.net>
>> Sent: Friday, April 29, 2005 12:15 PM
>> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>
>>
>> > Tim,
>> >
>> > thanks, but I'm still not sure about your setup. You stated that you 
>> > have
>>
>> > 32
>> > dual machines. So you made a hostgroup @mymachine-0 with which machines
>> > setup
>> > therein? - And why so many queues at all?
>> >
>> > CU - Reuti
>> >
>> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
>> >
>> >> Hi,
>> >>
>> >> I get:
>> >>
>> >>      59 0.55500 Job1    user        r     04/29/2005 10:47:08
>> >> mymachine.q.0 at local0                     8
>> >>        Full jobname:     Job1
>> >>        Master queue:     mymachine.q.0 at local0
>> >>        Requested PE:     *.mpi 8
>> >>        Granted PE:       mymachine.3.mpi 8
>> >>        Hard Resources:
>> >>        Soft Resources:
>> >>      47 0.55500 Job2    user        r     04/27/2005 14:45:04
>> >> mymachine.q.0 at local6                     8
>> >>        Full jobname:     Job2
>> >>        Master queue:     mymachine.q.0 at local6
>> >>        Requested PE:     *.mpi 8
>> >>        Granted PE:       mymachine.3.mpi 8
>> >>        Hard Resources:
>> >>        Soft Resources:
>> >>      44 0.55500 Job3    user        r     04/27/2005 11:55:49
>> >> mymachineq.1 at local12                    8
>> >>        Full jobname:     Job3
>> >>        Master queue:     mymachine.q.1 at local12
>> >>        Requested PE:     *.mpi 8
>> >>        Granted PE:       mymachine.3.mpi 8
>> >>        Hard Resources:
>> >>        Soft Resources:
>> >>      60 0.55500 Job4    user        r     04/29/2005 10:55:53
>> >> mymachine.q.1 at local9                     8
>> >>        Full jobname:     Job4
>> >>        Master queue:     mymachine.q.1 at local9
>> >>        Requested PE:     *.mpi 8
>> >>        Granted PE:       mymachine.3.mpi 8
>> >>        Hard Resources:
>> >>        Soft Resources:
>> >>      49 0.55500 Job5    user        r     04/27/2005 15:01:53
>> >> mymachine.q.2 at local16                    8
>> >>        Full jobname:     Job5
>> >>        Master queue:     mymachine.q.2 at local16
>> >>        Requested PE:     *.mpi 8
>> >>        Granted PE:       mymachine.3.mpi 8
>> >>        Hard Resources:
>> >>        Soft Resources:
>> >>      48 0.55500 Job6    user        r     04/27/2005 14:57:53
>> >> mymachine.q.2 at local20                    8
>> >>        Full jobname:     Job6
>> >>        Master queue:     mymachine.q.2 at local20
>> >>        Requested PE:     *.mpi 8
>> >>        Granted PE:       mymachine.3.mpi 8
>> >>        Hard Resources:
>> >>        Soft Resources:
>> >>      61 0.55500 Job7    user        r    04/29/2005 11:19:54
>> >> 8
>> >>        Full jobname:     Job7
>> >>        Requested PE:     *.mpi 8
>> >>        Hard Resources:
>> >>        Soft Resources:
>> >>
>> >> When I do qconf -sp mymachine.3.mpi, I get:
>> >>
>> >> pe_name           mymachine.3.mpi
>> >> slots             16
>> >> user_lists        NONE
>> >> xuser_lists       NONE
>> >> start_proc_args   /bin/true
>> >> stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
>> >> allocation_rule   $round_robin
>> >> control_slaves    TRUE
>> >> job_is_first_task FALSE
>> >> urgency_slots     avg
>> >>
>> >> When I do qconf -sq mymachine.q.0, I get
>> >>
>> >> qname                 mymachine.q.0
>> >> hostlist              @mymachine-0
>> >> seq_no                0
>> >> load_thresholds       NONE
>> >> suspend_thresholds    NONE
>> >> nsuspend              1
>> >> suspend_interval      00:05:00
>> >> priority              0
>> >> min_cpu_interval      00:05:00
>> >> processors            UNDEFINED
>> >> qtype                 BATCH INTERACTIVE
>> >> ckpt_list             NONE
>> >> pe_list               mymachine.0.mpi
>> >> rerun                 FALSE
>> >> slots                 2
>> >> tmpdir                /tmp
>> >> shell                 /bin/bash
>> >> prolog                NONE
>> >> epilog                NONE
>> >> shell_start_mode      posix_compliant
>> >> starter_method        NONE
>> >> suspend_method        NONE
>> >> resume_method         NONE
>> >> terminate_method      NONE
>> >> notify                00:00:60
>> >> owner_list            sgeadmin
>> >> user_lists            NONE
>> >> xuser_lists           NONE
>> >> subordinate_list      NONE
>> >> complex_values        NONE
>> >> projects              NONE
>> >> xprojects             NONE
>> >> calendar              NONE
>> >> initial_state         default
>> >> s_rt                  84:00:00
>> >> h_rt                  84:15:00
>> >> s_cpu                 INFINITY
>> >> h_cpu                 INFINITY
>> >> s_fsize               INFINITY
>> >> h_fsize               INFINITY
>> >> s_data                INFINITY
>> >> h_data                INFINITY
>> >> s_stack               INFINITY
>> >> h_stack               INFINITY
>> >> s_core                INFINITY
>> >> h_core                INFINITY
>> >> s_rss                 1G
>> >> h_rss                 1G
>> >> s_vmem                INFINITY
>> >> h_vmem                INFINITY
>> >>
>> >> And so on, up to mymachine.q.3.
>> >>
>> >> Tim
>> >>
>> >> ----- Original Message ----- 
>> >> From: "Reuti" <reuti at staff.uni-marburg.de>
>> >> To: <users at gridengine.sunsource.net>
>> >> Sent: Friday, April 29, 2005 11:14 AM
>> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>> >>
>> >>
>> >> > Hi Tim,
>> >> >
>> >> > what is:
>> >> >
>> >> > qstat -r
>> >> >
>> >> > showing as granted PEs? - Reuti
>> >> >
>> >> >
>> >> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> That's the problem.  The setup is actually
>> >> >>
>> >> >> mymachine.q.0 references mymachine.0.mpi
>> >> >> mymachine.q.1 references mymachine.1.mpi
>> >> >> mymachine.q.2 references mymachine.2.mpi
>> >> >> mymachine.q.3 references mymachine.3.mpi
>> >> >>
>> >> >> There is no reason, as far as I can tell, that a job could ever be 
>> >> >> in
>> >> >> both
>> >> >> mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I use
>> >> >> wildcards,
>> >> >>
>> >> >> the the scheduler won't put a job assigned to mymachine.3.mpi into
>> >> >> mymachine.q.3 until all of the other queues are full.  At that 
>> >> >> point,
>> >> >> it's
>> >> >> too late because mymachine.3.mpi is using 48 slots, when it's only
>> >> >> allowed
>> >> >> to use up to 16.
>> >> >>
>> >> >> When I don't use wildcards, I get the behavior I expect:  A job
>> >> >> submitted
>> >>
>> >> >> to
>> >> >>
>> >> >> mymachine.3.mpi gets put in mymachine.q.3, etc.
>> >> >>
>> >> >> Tim
>> >> >>
>> >> >> ----- Original Message ----- 
>> >> >> From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
>> >> >> <stephan.grell at sun.com>
>> >> >> To: <users at gridengine.sunsource.net>
>> >> >> Sent: Friday, April 29, 2005 2:34 AM
>> >> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>> >> >>
>> >> >>
>> >> >> > Hi Tim,
>> >> >> >
>> >> >> > I am not quite sure I understand your setup. Could you please
>> attach
>> >> >> > your
>> >> >>
>> >> >> > cqueue configuration? From
>> >> >> > the results you posted, it reads as if:
>> >> >> > queue
>> >> >> > mymachine.q.0  references mymachine.3.mpi
>> >> >> > mymachine.q.1  reference mymachine.3.mpi
>> >> >> >
>> >> >> > and so on.
>> >> >> >
>> >> >> > Cheers,
>> >> >> > Stephan
>> >> >> >
>> >> >> > Tim Mueller wrote:
>> >> >> >
>> >> >> >> Hi,
>> >> >> >>  It appears that wildcards in the Parallel Environment name 
>> >> >> >> still
>> >> have
>> >> >> >> problems in 6.0u3.  I have set up a linux cluster of 32 dual
>> >> processor
>> >> >> >> Noconas running Linux.  There are 4 queues of 16 processors 
>> >> >> >> each,
>> >> >> >> and
>> >>
>> >> >> >> a
>> >> >> >> corresponding pe for each queue.  The queues are named as 
>> >> >> >> follows:
>> >> >> >>  mymachine.q.0
>> >> >> >> mymachine.q.1
>> >> >> >> mymachine.q.2
>> >> >> >> mymachine.q.3
>> >> >> >>  And the PE's are
>> >> >> >>  mymachine.0.mpi
>> >> >> >> mymachine.1.mpi
>> >> >> >> mymachine.2.mpi
>> >> >> >> mymachine.3.mpi
>> >> >> >>  All of the PE's have 16 slots.  When I submit a job with the
>> >> >> >> following
>> >> >> >> line:
>> >> >> >>  #$ -pe *.mpi 8
>> >> >> >>  the job will be assigned to a seemingly random PE, but then 
>> >> >> >> placed
>>
>> >> >> >> in
>> >>
>> >> >> >> a
>> >> >>
>> >> >> >> queue that does not correspond to that PE.  I can submit up to 6
>> >> >> >> jobs
>> >> >> >> this way, each of which will get assigned to the same PE and
>> placed
>> >> in
>> >> >> >> any queue that does not correspond to the PE.  This causes 48
>> >> >> >> processors
>> >> >>
>> >> >> >> to be used for a PE with only 16 slots.  E.g., I might get:
>> >> >> >>  Job 1        mymachine.3.mpi        mymachine.q.0        8
>> >> processors
>> >> >> >> Job 2        mymachine.3.mpi        mymachine.q.0        8
>> >> >> >> processors
>> >> >> >> Job 3        mymachine.3.mpi        mymachine.q.1        8
>> >> >> >> processors
>> >> >> >> Job 4        mymachine.3.mpi        mymachine.q.1        8
>> >> >> >> processors
>> >> >> >> Job 5        mymachine.3.mpi        mymachine.q.2        8
>> >> >> >> processors
>> >> >> >> Job 6        mymachine.3.mpi        mymachine.q.2        8
>> >> >> >> processors
>> >> >> >> Job 7        qw
>> >> >> >> Job 8        qw
>> >> >> >>  When I should get:
>> >> >> >>  Job 1        mymachine.0.mpi        mymachine.q.0        8
>> >> processors
>> >> >> >> Job 2        mymachine.0.mpi        mymachine.q.0        8
>> >> >> >> processors
>> >> >> >> Job 3        mymachine.1.mpi        mymachine.q.1        8
>> >> >> >> processors
>> >> >> >> Job 4        mymachine.1.mpi        mymachine.q.1        8
>> >> >> >> processors
>> >> >> >> Job 5        mymachine.2.mpi        mymachine.q.2        8
>> >> >> >> processors
>> >> >> >> Job 6        mymachine.2.mpi        mymachine.q.2        8
>> >> >> >> processors
>> >> >> >> Job 5        mymachine.3.mpi        mymachine.q.3        8
>> >> >> >> processors
>> >> >> >> Job 6        mymachine.3.mpi        mymachine.q.3        8
>> >> >> >> processors
>> >> >> >>  If I try to then submit a job directly (with no wildcard) to 
>> >> >> >> the
>> >> >> >> PE
>> >> >> >> that
>> >> >>
>> >> >> >> all of the jobs were assigned to, it will not run because I have
>> >> >> >> already
>> >> >>
>> >> >> >> far exceeded the slots limit for this PE.
>> >> >> >>  I should note that when I do not use wildcards, everything 
>> >> >> >> behaves
>>
>> >> >> >> as
>> >>
>> >> >> >> it
>> >> >>
>> >> >> >> should.  E.g, a job submitted to mymachine.2.mpi will be 
>> >> >> >> assigned
>> >> >> >> to
>> >> >> >> mymachine.2.mpi and mymachine.2.q, and I cannot use more than 16
>> >> >> >> slots
>> >>
>> >> >> >> in
>> >> >>
>> >> >> >> mymachine.2.mpi at once.
>> >> >> >>  I searched the list, and although there seem to have been other
>> >> >> >> problems
>> >> >>
>> >> >> >> with wildcards in the past, I have seen nothing that references
>> >> >> >> this
>> >> >> >> behavior.  Does anyone have an explanation / workaround?
>> >> >> >>  Tim
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> ---------------------------------------------------------------------
>> >> >> > To unsubscribe, e-mail: 
>> >> >> > users-unsubscribe at gridengine.sunsource.net
>> >> >> > For additional commands, e-mail:
>> users-help at gridengine.sunsource.net
>> >> >> >
>> >> >> >
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >> >> For additional commands, e-mail: 
>> >> >> users-help at gridengine.sunsource.net
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >>
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list