[GE users] Wildcards in PE still broken in 6.0u3

Tim Mueller tim_mueller at hotmail.com
Sat Apr 30 02:49:26 BST 2005


    [ The following text is in the "iso-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

For future reference, here is the issue:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=1597

Tim

----- Original Message ----- 
From: "Reuti" <reuti at staff.uni-marburg.de>
To: <users at gridengine.sunsource.net>
Sent: Friday, April 29, 2005 4:01 PM
Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3


> Hi Tim,
>
> I could reproduce the weird behavior. Can you please file a bug? As I 
> found, in
> 6.0u1 it was still working, it must be introduced in one of the following
> releases. Seems that there is now also an order in taking the slots for 
> the PEs
> - the one from mymachine.q.0 are taken first, then the ones from 
> mymachine.q.1
> ..
>
> Cheers - Reuti
>
> Quoting Tim Mueller <tim_mueller at hotmail.com>:
>
>> Some more information...  If I run qstat -j 61, I get the output below.
>>
>> Tim
>>
>> ....................
>> JOB INFO CUT
>> ......................
>> script_file:                Job7
>> parallel environment:  *.mpi range: 8
>> scheduling info:            queue instance "mymachine.q.0 at local0" dropped
>> because it is full
>>                             queue instance "mymachine.q.0 at local1" dropped
>> because it is full
>>                             queue instance "mymachine.q.0 at local2" dropped
>> because it is full
>>                             queue instance "mymachine.q.0 at local3" dropped
>> because it is full
>>                             queue instance "mymachine.q.0 at local4" dropped
>> because it is full
>>                             queue instance "mymachine.q.0 at local5" dropped
>> because it is full
>>                             queue instance "mymachine.q.0 at local6" dropped
>> because it is full
>>                             queue instance "mymachine.q.0 at local7" dropped
>> because it is full
>>                             queue instance "mymachine.q.1 at local10" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.1 at local11" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.1 at local12" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.1 at local13" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.1 at local14" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.1 at local15" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.1 at local8" dropped
>> because it is full
>>                             queue instance "mymachine.q.1 at local9" dropped
>> because it is full
>>                             queue instance "mymachine.q.2 at local16" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.2 at local17" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.2 at local18" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.2 at local19" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.2 at local20" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.2 at local21" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.2 at local22" 
>> dropped
>> because it is full
>>                             queue instance "mymachine.q.2 at local23" 
>> dropped
>> because it is full
>>                             cannot run in queue instance
>> "mymachine.q.3 at local30" because PE "mymachine.2.mpi" is not in pe list
>>                             cannot run in queue instance
>> "mymachine.q.3 at local26" because PE "mymachine.2.mpi" is not in pe list
>>                             cannot run in queue instance
>> "mymachine.q.3 at local25" because PE "mymachine.2.mpi" is not in pe list
>>                             cannot run in queue instance
>> "mymachine.q.3 at local24" because PE "mymachine.2.mpi" is not in pe list
>>                             cannot run in queue instance
>> "mymachine.q.3 at local27" because PE "mymachine.2.mpi" is not in pe list
>>                             cannot run in queue instance
>> "mymachine.q.3 at local28" because PE "mymachine.2.mpi" is not in pe list
>>                             cannot run in queue instance
>> "mymachine.q.3 at local29" because PE "mymachine.2.mpi" is not in pe list
>>                             cannot run in queue instance
>> "mymachine.q.3 at local31" because PE "mymachine.2.mpi" is not in pe list
>>                             cannot run because resources requested are 
>> not
>> available for parallel job
>>                             cannot run because available slots combined
>> under PE "mymachine.2.mpi" are not in range of job
>>                             cannot run because available slots combined
>> under PE "mymachine.3.mpi" are not in range of job
>>
>> ----- Original Message ----- 
>> From: "Tim Mueller" <tim_mueller at hotmail.com>
>> To: <users at gridengine.sunsource.net>
>> Sent: Friday, April 29, 2005 2:11 PM
>> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>
>>
>> > That's what I had hoped initially.  However, it does not explain why no
>> > jobs get assigned to mymachine.q.3, which is the only queue to which 
>> > they
>>
>> > should get assigned.  It appears that jobs get rejected from this queue
>> > because the scheduler believes mymachine.3.mpi is too full.
>> >
>> > qstat -g t gives the following:
>> >
>> > job-ID  prior   name       user         state submit/start at     queue
>> > master ja-task-ID
>> >
>>
> -------------------------------------------------------------------------------
> -----------------------------------
>> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04
>> > mymachine.q.0 at local0                 SLAVE
>> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08
>> > mymachine.q.0 at local0                 MASTER
>> >
>> > mymachine.q.0 at local0 SLAVE
>> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04
>> > mymachine.q.0 at local1                 SLAVE
>> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08
>> > mymachine.q.0 at local1                 SLAVE
>> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04
>> > mymachine.q.0 at local2                 SLAVE
>> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08
>> > mymachine.q.0 at local2                 SLAVE
>> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04
>> > mymachine.q.0 at local3                 SLAVE
>> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08
>> > mymachine.q.0 at local3                 SLAVE
>> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04
>> > mymachine.q.0 at local4                 SLAVE
>> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08
>> > mymachine.q.0 at local4                 SLAVE
>> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04
>> > mymachine.q.0 at local5                 SLAVE
>> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08
>> > mymachine.q.0 at local5                 SLAVE
>> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04
>> > mymachine.q.0 at local6                 MASTER
>> >
>> > mymachine.q.0 at local6 SLAVE
>> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08
>> > mymachine.q.0 at local6                 SLAVE
>> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04
>> > mymachine.q.0 at local7                 SLAVE
>> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08
>> > mymachine.q.0 at local7                 SLAVE
>> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49
>> > mymachine.q.1 at local10                SLAVE
>> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53
>> > mymachine.q.1 at local10                SLAVE
>> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49
>> > mymachine.q.1 at local11                SLAVE
>> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53
>> > mymachine.q.1 at local11                SLAVE
>> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49
>> > mymachine.q.1 at local12                MASTER
>> >
>> > mymachine.q.1 at local12 SLAVE
>> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53
>> > mymachine.q.1 at local12                SLAVE
>> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49
>> > mymachine.q.1 at local13                SLAVE
>> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53
>> > mymachine.q.1 at local13                SLAVE
>> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49
>> > mymachine.q.1 at local14                SLAVE
>> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53
>> > mymachine.q.1 at local14                SLAVE
>> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49
>> > mymachine.q.1 at local15                SLAVE
>> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53
>> > mymachine.q.1 at local15                SLAVE
>> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49
>> > mymachine.q.1 at local8                 SLAVE
>> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53
>> > mymachine.q.1 at local8                 SLAVE
>> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49
>> > mymachine.q.1 at local9                 SLAVE
>> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53
>> > mymachine.q.1 at local9                 MASTER
>> >
>> > mymachine.q.1 at local9 SLAVE
>> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53
>> > mymachine.q.2 at local16                SLAVE
>> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53
>> > mymachine.q.2 at local16                MASTER
>> >
>> > mymachine.q.2 at local16 SLAVE
>> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53
>> > mymachine.q.2 at local17                SLAVE
>> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53
>> > mymachine.q.2 at local17                SLAVE
>> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53
>> > mymachine.q.2 at local18                SLAVE
>> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53
>> > mymachine.q.2 at local18                SLAVE
>> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53
>> > mymachine.q.2 at local19                SLAVE
>> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53
>> > mymachine.q.2 at local19                SLAVE
>> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53
>> > mymachine.q.2 at local20                MASTER
>> >
>> > mymachine.q.2 at local20 SLAVE
>> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53
>> > mymachine.q.2 at local20                SLAVE
>> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53
>> > mymachine.q.2 at local21                SLAVE
>> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53
>> > mymachine.q.2 at local21                SLAVE
>> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53
>> > mymachine.q.2 at local22                SLAVE
>> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53
>> > mymachine.q.2 at local22                SLAVE
>> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53
>> > mymachine.q.2 at local23                SLAVE
>> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53
>> > mymachine.q.2 at local23                SLAVE
>> >     61 0.55500 Job7        user        qw    04/29/2005 11:19:54
>> >
>> > Tim
>> >
>> > ----- Original Message ----- 
>> > From: "Reuti" <reuti at staff.uni-marburg.de>
>> > To: <users at gridengine.sunsource.net>
>> > Sent: Friday, April 29, 2005 1:24 PM
>> > Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>> >
>> >
>> >> Aha Tim,
>> >>
>> >> now I understand your setup. As the naming of the masterq, e.g.
>> >> mymachine.q.1 at local9 is inside your intended configuration, what shows
>> >>
>> >> qstat -g t
>> >>
>> >> Maybe the output of the granted PE is just wrong, but all is working 
>> >> as
>> >> intended? - Reuti
>> >>
>> >> Quoting Tim Mueller <tim_mueller at hotmail.com>:
>> >>
>> >>> There are 32 machines, each dual-processor with names
>> >>>
>> >>> local0
>> >>> local1
>> >>> ..
>> >>> local31
>> >>>
>> >>> They are grouped together with four 8-port gigabit switches.  Each 
>> >>> group
>>
>> >>> was
>> >>>
>> >>> given a queue, a PE, and a hostgroup.  So for example @mymachine-0
>> >>> contains
>> >>>
>> >>> local0
>> >>> local1
>> >>> ..
>> >>> local7
>> >>>
>> >>> local0-local7 are all connected via both the central cluster switch 
>> >>> and
>>
>> >>> a
>> >>> local gigabit switch.
>> >>>
>> >>> I should also note that I am using hostname aliasing to ensure that 
>> >>> the
>> >>> ethernet interface connected to the gigabit switch is used by Grid
>> >>> Engine.
>> >>> So I have a file host_aliases file set up as follows:
>> >>>
>> >>> local0 node0
>> >>> local1 node1
>> >>> ..
>> >>> local31 node31
>> >>>
>> >>> Where "nodeX" is the primary hostname for each machine and resolves 
>> >>> to
>> >>> the
>> >>> interface that connets to the central cluster switch.  "localX" 
>> >>> resolves
>>
>> >>> to
>> >>>
>> >>> an address that connects via the gigabit interface if possible.  The
>> >>> "localX" names do not resolve consistently across the cluster -- for
>> >>> example
>> >>>
>> >>> if I am on node0 and I ping local1, it will do so over the gigabit
>> >>> interface.  However if I am on node31 and I ping local1, it will do 
>> >>> so
>> >>> over
>> >>>
>> >>> the non-gigabit interface, because there is no gigabit connection
>> >>> between
>> >>> node31 and node1.
>> >>>
>> >>> Tim
>> >>>
>> >>> ----- Original Message ----- 
>> >>> From: "Reuti" <reuti at staff.uni-marburg.de>
>> >>> To: <users at gridengine.sunsource.net>
>> >>> Sent: Friday, April 29, 2005 12:15 PM
>> >>> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>> >>>
>> >>>
>> >>> > Tim,
>> >>> >
>> >>> > thanks, but I'm still not sure about your setup. You stated that 
>> >>> > you
>> >>> > have
>> >>>
>> >>> > 32
>> >>> > dual machines. So you made a hostgroup @mymachine-0 with which
>> >>> > machines
>> >>> > setup
>> >>> > therein? - And why so many queues at all?
>> >>> >
>> >>> > CU - Reuti
>> >>> >
>> >>> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
>> >>> >
>> >>> >> Hi,
>> >>> >>
>> >>> >> I get:
>> >>> >>
>> >>> >>      59 0.55500 Job1    user        r     04/29/2005 10:47:08
>> >>> >> mymachine.q.0 at local0                     8
>> >>> >>        Full jobname:     Job1
>> >>> >>        Master queue:     mymachine.q.0 at local0
>> >>> >>        Requested PE:     *.mpi 8
>> >>> >>        Granted PE:       mymachine.3.mpi 8
>> >>> >>        Hard Resources:
>> >>> >>        Soft Resources:
>> >>> >>      47 0.55500 Job2    user        r     04/27/2005 14:45:04
>> >>> >> mymachine.q.0 at local6                     8
>> >>> >>        Full jobname:     Job2
>> >>> >>        Master queue:     mymachine.q.0 at local6
>> >>> >>        Requested PE:     *.mpi 8
>> >>> >>        Granted PE:       mymachine.3.mpi 8
>> >>> >>        Hard Resources:
>> >>> >>        Soft Resources:
>> >>> >>      44 0.55500 Job3    user        r     04/27/2005 11:55:49
>> >>> >> mymachineq.1 at local12                    8
>> >>> >>        Full jobname:     Job3
>> >>> >>        Master queue:     mymachine.q.1 at local12
>> >>> >>        Requested PE:     *.mpi 8
>> >>> >>        Granted PE:       mymachine.3.mpi 8
>> >>> >>        Hard Resources:
>> >>> >>        Soft Resources:
>> >>> >>      60 0.55500 Job4    user        r     04/29/2005 10:55:53
>> >>> >> mymachine.q.1 at local9                     8
>> >>> >>        Full jobname:     Job4
>> >>> >>        Master queue:     mymachine.q.1 at local9
>> >>> >>        Requested PE:     *.mpi 8
>> >>> >>        Granted PE:       mymachine.3.mpi 8
>> >>> >>        Hard Resources:
>> >>> >>        Soft Resources:
>> >>> >>      49 0.55500 Job5    user        r     04/27/2005 15:01:53
>> >>> >> mymachine.q.2 at local16                    8
>> >>> >>        Full jobname:     Job5
>> >>> >>        Master queue:     mymachine.q.2 at local16
>> >>> >>        Requested PE:     *.mpi 8
>> >>> >>        Granted PE:       mymachine.3.mpi 8
>> >>> >>        Hard Resources:
>> >>> >>        Soft Resources:
>> >>> >>      48 0.55500 Job6    user        r     04/27/2005 14:57:53
>> >>> >> mymachine.q.2 at local20                    8
>> >>> >>        Full jobname:     Job6
>> >>> >>        Master queue:     mymachine.q.2 at local20
>> >>> >>        Requested PE:     *.mpi 8
>> >>> >>        Granted PE:       mymachine.3.mpi 8
>> >>> >>        Hard Resources:
>> >>> >>        Soft Resources:
>> >>> >>      61 0.55500 Job7    user        r    04/29/2005 11:19:54
>> >>> >> 8
>> >>> >>        Full jobname:     Job7
>> >>> >>        Requested PE:     *.mpi 8
>> >>> >>        Hard Resources:
>> >>> >>        Soft Resources:
>> >>> >>
>> >>> >> When I do qconf -sp mymachine.3.mpi, I get:
>> >>> >>
>> >>> >> pe_name           mymachine.3.mpi
>> >>> >> slots             16
>> >>> >> user_lists        NONE
>> >>> >> xuser_lists       NONE
>> >>> >> start_proc_args   /bin/true
>> >>> >> stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
>> >>> >> allocation_rule   $round_robin
>> >>> >> control_slaves    TRUE
>> >>> >> job_is_first_task FALSE
>> >>> >> urgency_slots     avg
>> >>> >>
>> >>> >> When I do qconf -sq mymachine.q.0, I get
>> >>> >>
>> >>> >> qname                 mymachine.q.0
>> >>> >> hostlist              @mymachine-0
>> >>> >> seq_no                0
>> >>> >> load_thresholds       NONE
>> >>> >> suspend_thresholds    NONE
>> >>> >> nsuspend              1
>> >>> >> suspend_interval      00:05:00
>> >>> >> priority              0
>> >>> >> min_cpu_interval      00:05:00
>> >>> >> processors            UNDEFINED
>> >>> >> qtype                 BATCH INTERACTIVE
>> >>> >> ckpt_list             NONE
>> >>> >> pe_list               mymachine.0.mpi
>> >>> >> rerun                 FALSE
>> >>> >> slots                 2
>> >>> >> tmpdir                /tmp
>> >>> >> shell                 /bin/bash
>> >>> >> prolog                NONE
>> >>> >> epilog                NONE
>> >>> >> shell_start_mode      posix_compliant
>> >>> >> starter_method        NONE
>> >>> >> suspend_method        NONE
>> >>> >> resume_method         NONE
>> >>> >> terminate_method      NONE
>> >>> >> notify                00:00:60
>> >>> >> owner_list            sgeadmin
>> >>> >> user_lists            NONE
>> >>> >> xuser_lists           NONE
>> >>> >> subordinate_list      NONE
>> >>> >> complex_values        NONE
>> >>> >> projects              NONE
>> >>> >> xprojects             NONE
>> >>> >> calendar              NONE
>> >>> >> initial_state         default
>> >>> >> s_rt                  84:00:00
>> >>> >> h_rt                  84:15:00
>> >>> >> s_cpu                 INFINITY
>> >>> >> h_cpu                 INFINITY
>> >>> >> s_fsize               INFINITY
>> >>> >> h_fsize               INFINITY
>> >>> >> s_data                INFINITY
>> >>> >> h_data                INFINITY
>> >>> >> s_stack               INFINITY
>> >>> >> h_stack               INFINITY
>> >>> >> s_core                INFINITY
>> >>> >> h_core                INFINITY
>> >>> >> s_rss                 1G
>> >>> >> h_rss                 1G
>> >>> >> s_vmem                INFINITY
>> >>> >> h_vmem                INFINITY
>> >>> >>
>> >>> >> And so on, up to mymachine.q.3.
>> >>> >>
>> >>> >> Tim
>> >>> >>
>> >>> >> ----- Original Message ----- 
>> >>> >> From: "Reuti" <reuti at staff.uni-marburg.de>
>> >>> >> To: <users at gridengine.sunsource.net>
>> >>> >> Sent: Friday, April 29, 2005 11:14 AM
>> >>> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>> >>> >>
>> >>> >>
>> >>> >> > Hi Tim,
>> >>> >> >
>> >>> >> > what is:
>> >>> >> >
>> >>> >> > qstat -r
>> >>> >> >
>> >>> >> > showing as granted PEs? - Reuti
>> >>> >> >
>> >>> >> >
>> >>> >> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
>> >>> >> >
>> >>> >> >> Hi,
>> >>> >> >>
>> >>> >> >> That's the problem.  The setup is actually
>> >>> >> >>
>> >>> >> >> mymachine.q.0 references mymachine.0.mpi
>> >>> >> >> mymachine.q.1 references mymachine.1.mpi
>> >>> >> >> mymachine.q.2 references mymachine.2.mpi
>> >>> >> >> mymachine.q.3 references mymachine.3.mpi
>> >>> >> >>
>> >>> >> >> There is no reason, as far as I can tell, that a job could ever 
>> >>> >> >> be
>>
>> >>> >> >> in
>> >>> >> >> both
>> >>> >> >> mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I 
>> >>> >> >> use
>> >>> >> >> wildcards,
>> >>> >> >>
>> >>> >> >> the the scheduler won't put a job assigned to mymachine.3.mpi
>> into
>> >>> >> >> mymachine.q.3 until all of the other queues are full.  At that
>> >>> >> >> point,
>> >>> >> >> it's
>> >>> >> >> too late because mymachine.3.mpi is using 48 slots, when it's
>> only
>> >>> >> >> allowed
>> >>> >> >> to use up to 16.
>> >>> >> >>
>> >>> >> >> When I don't use wildcards, I get the behavior I expect:  A job
>> >>> >> >> submitted
>> >>> >>
>> >>> >> >> to
>> >>> >> >>
>> >>> >> >> mymachine.3.mpi gets put in mymachine.q.3, etc.
>> >>> >> >>
>> >>> >> >> Tim
>> >>> >> >>
>> >>> >> >> ----- Original Message ----- 
>> >>> >> >> From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
>> >>> >> >> <stephan.grell at sun.com>
>> >>> >> >> To: <users at gridengine.sunsource.net>
>> >>> >> >> Sent: Friday, April 29, 2005 2:34 AM
>> >>> >> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> > Hi Tim,
>> >>> >> >> >
>> >>> >> >> > I am not quite sure I understand your setup. Could you please
>> >>> attach
>> >>> >> >> > your
>> >>> >> >>
>> >>> >> >> > cqueue configuration? From
>> >>> >> >> > the results you posted, it reads as if:
>> >>> >> >> > queue
>> >>> >> >> > mymachine.q.0  references mymachine.3.mpi
>> >>> >> >> > mymachine.q.1  reference mymachine.3.mpi
>> >>> >> >> >
>> >>> >> >> > and so on.
>> >>> >> >> >
>> >>> >> >> > Cheers,
>> >>> >> >> > Stephan
>> >>> >> >> >
>> >>> >> >> > Tim Mueller wrote:
>> >>> >> >> >
>> >>> >> >> >> Hi,
>> >>> >> >> >>  It appears that wildcards in the Parallel Environment name
>> >>> >> >> >> still
>> >>> >> have
>> >>> >> >> >> problems in 6.0u3.  I have set up a linux cluster of 32 dual
>> >>> >> processor
>> >>> >> >> >> Noconas running Linux.  There are 4 queues of 16 processors
>> >>> >> >> >> each,
>> >>> >> >> >> and
>> >>> >>
>> >>> >> >> >> a
>> >>> >> >> >> corresponding pe for each queue.  The queues are named as
>> >>> >> >> >> follows:
>> >>> >> >> >>  mymachine.q.0
>> >>> >> >> >> mymachine.q.1
>> >>> >> >> >> mymachine.q.2
>> >>> >> >> >> mymachine.q.3
>> >>> >> >> >>  And the PE's are
>> >>> >> >> >>  mymachine.0.mpi
>> >>> >> >> >> mymachine.1.mpi
>> >>> >> >> >> mymachine.2.mpi
>> >>> >> >> >> mymachine.3.mpi
>> >>> >> >> >>  All of the PE's have 16 slots.  When I submit a job with 
>> >>> >> >> >> the
>> >>> >> >> >> following
>> >>> >> >> >> line:
>> >>> >> >> >>  #$ -pe *.mpi 8
>> >>> >> >> >>  the job will be assigned to a seemingly random PE, but then
>> >>> >> >> >> placed
>> >>>
>> >>> >> >> >> in
>> >>> >>
>> >>> >> >> >> a
>> >>> >> >>
>> >>> >> >> >> queue that does not correspond to that PE.  I can submit up 
>> >>> >> >> >> to
>>
>> >>> >> >> >> 6
>> >>> >> >> >> jobs
>> >>> >> >> >> this way, each of which will get assigned to the same PE and
>> >>> placed
>> >>> >> in
>> >>> >> >> >> any queue that does not correspond to the PE.  This causes 
>> >>> >> >> >> 48
>> >>> >> >> >> processors
>> >>> >> >>
>> >>> >> >> >> to be used for a PE with only 16 slots.  E.g., I might get:
>> >>> >> >> >>  Job 1        mymachine.3.mpi        mymachine.q.0        8
>> >>> >> processors
>> >>> >> >> >> Job 2        mymachine.3.mpi        mymachine.q.0        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 3        mymachine.3.mpi        mymachine.q.1        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 4        mymachine.3.mpi        mymachine.q.1        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 5        mymachine.3.mpi        mymachine.q.2        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 6        mymachine.3.mpi        mymachine.q.2        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 7        qw
>> >>> >> >> >> Job 8        qw
>> >>> >> >> >>  When I should get:
>> >>> >> >> >>  Job 1        mymachine.0.mpi        mymachine.q.0        8
>> >>> >> processors
>> >>> >> >> >> Job 2        mymachine.0.mpi        mymachine.q.0        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 3        mymachine.1.mpi        mymachine.q.1        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 4        mymachine.1.mpi        mymachine.q.1        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 5        mymachine.2.mpi        mymachine.q.2        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 6        mymachine.2.mpi        mymachine.q.2        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 5        mymachine.3.mpi        mymachine.q.3        8
>> >>> >> >> >> processors
>> >>> >> >> >> Job 6        mymachine.3.mpi        mymachine.q.3        8
>> >>> >> >> >> processors
>> >>> >> >> >>  If I try to then submit a job directly (with no wildcard) 
>> >>> >> >> >> to
>> >>> >> >> >> the
>> >>> >> >> >> PE
>> >>> >> >> >> that
>> >>> >> >>
>> >>> >> >> >> all of the jobs were assigned to, it will not run because I
>> >>> >> >> >> have
>> >>> >> >> >> already
>> >>> >> >>
>> >>> >> >> >> far exceeded the slots limit for this PE.
>> >>> >> >> >>  I should note that when I do not use wildcards, everything
>> >>> >> >> >> behaves
>> >>>
>> >>> >> >> >> as
>> >>> >>
>> >>> >> >> >> it
>> >>> >> >>
>> >>> >> >> >> should.  E.g, a job submitted to mymachine.2.mpi will be
>> >>> >> >> >> assigned
>> >>> >> >> >> to
>> >>> >> >> >> mymachine.2.mpi and mymachine.2.q, and I cannot use more 
>> >>> >> >> >> than
>> >>> >> >> >> 16
>> >>> >> >> >> slots
>> >>> >>
>> >>> >> >> >> in
>> >>> >> >>
>> >>> >> >> >> mymachine.2.mpi at once.
>> >>> >> >> >>  I searched the list, and although there seem to have been
>> >>> >> >> >> other
>> >>> >> >> >> problems
>> >>> >> >>
>> >>> >> >> >> with wildcards in the past, I have seen nothing that
>> references
>> >>> >> >> >> this
>> >>> >> >> >> behavior.  Does anyone have an explanation / workaround?
>> >>> >> >> >>  Tim
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> >
>> >>> ---------------------------------------------------------------------
>> >>> >> >> > To unsubscribe, e-mail:
>> >>> >> >> > users-unsubscribe at gridengine.sunsource.net
>> >>> >> >> > For additional commands, e-mail:
>> >>> users-help at gridengine.sunsource.net
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >>
>> >>> >> >>
>> ---------------------------------------------------------------------
>> >>> >> >> To unsubscribe, e-mail:
>> users-unsubscribe at gridengine.sunsource.net
>> >>> >> >> For additional commands, e-mail:
>> >>> >> >> users-help at gridengine.sunsource.net
>> >>> >> >>
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> ---------------------------------------------------------------------
>> >>> >> > To unsubscribe, e-mail: 
>> >>> >> > users-unsubscribe at gridengine.sunsource.net
>> >>> >> > For additional commands, e-mail:
>> >>> >> > users-help at gridengine.sunsource.net
>> >>> >> >
>> >>> >> >
>> >>> >>
>> >>> >>
>> ---------------------------------------------------------------------
>> >>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >>> >> For additional commands, e-mail: 
>> >>> >> users-help at gridengine.sunsource.net
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > ---------------------------------------------------------------------
>> >>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >>> > For additional commands, e-mail: 
>> >>> > users-help at gridengine.sunsource.net
>> >>> >
>> >>> >
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >>>
>> >>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >>
>> >>
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list