[GE users] Wildcards in PE still broken in 6.0u3

Reuti reuti at staff.uni-marburg.de
Fri Apr 29 21:01:12 BST 2005


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Hi Tim,

I could reproduce the weird behavior. Can you please file a bug? As I found, in 
6.0u1 it was still working, it must be introduced in one of the following 
releases. Seems that there is now also an order in taking the slots for the PEs 
- the one from mymachine.q.0 are taken first, then the ones from mymachine.q.1 
..

Cheers - Reuti

Quoting Tim Mueller <tim_mueller at hotmail.com>:

> Some more information...  If I run qstat -j 61, I get the output below.
> 
> Tim
> 
> ....................
> JOB INFO CUT
> ......................
> script_file:                Job7
> parallel environment:  *.mpi range: 8
> scheduling info:            queue instance "mymachine.q.0 at local0" dropped 
> because it is full
>                             queue instance "mymachine.q.0 at local1" dropped 
> because it is full
>                             queue instance "mymachine.q.0 at local2" dropped 
> because it is full
>                             queue instance "mymachine.q.0 at local3" dropped 
> because it is full
>                             queue instance "mymachine.q.0 at local4" dropped 
> because it is full
>                             queue instance "mymachine.q.0 at local5" dropped 
> because it is full
>                             queue instance "mymachine.q.0 at local6" dropped 
> because it is full
>                             queue instance "mymachine.q.0 at local7" dropped 
> because it is full
>                             queue instance "mymachine.q.1 at local10" dropped 
> because it is full
>                             queue instance "mymachine.q.1 at local11" dropped 
> because it is full
>                             queue instance "mymachine.q.1 at local12" dropped 
> because it is full
>                             queue instance "mymachine.q.1 at local13" dropped 
> because it is full
>                             queue instance "mymachine.q.1 at local14" dropped 
> because it is full
>                             queue instance "mymachine.q.1 at local15" dropped 
> because it is full
>                             queue instance "mymachine.q.1 at local8" dropped 
> because it is full
>                             queue instance "mymachine.q.1 at local9" dropped 
> because it is full
>                             queue instance "mymachine.q.2 at local16" dropped 
> because it is full
>                             queue instance "mymachine.q.2 at local17" dropped 
> because it is full
>                             queue instance "mymachine.q.2 at local18" dropped 
> because it is full
>                             queue instance "mymachine.q.2 at local19" dropped 
> because it is full
>                             queue instance "mymachine.q.2 at local20" dropped 
> because it is full
>                             queue instance "mymachine.q.2 at local21" dropped 
> because it is full
>                             queue instance "mymachine.q.2 at local22" dropped 
> because it is full
>                             queue instance "mymachine.q.2 at local23" dropped 
> because it is full
>                             cannot run in queue instance 
> "mymachine.q.3 at local30" because PE "mymachine.2.mpi" is not in pe list
>                             cannot run in queue instance 
> "mymachine.q.3 at local26" because PE "mymachine.2.mpi" is not in pe list
>                             cannot run in queue instance 
> "mymachine.q.3 at local25" because PE "mymachine.2.mpi" is not in pe list
>                             cannot run in queue instance 
> "mymachine.q.3 at local24" because PE "mymachine.2.mpi" is not in pe list
>                             cannot run in queue instance 
> "mymachine.q.3 at local27" because PE "mymachine.2.mpi" is not in pe list
>                             cannot run in queue instance 
> "mymachine.q.3 at local28" because PE "mymachine.2.mpi" is not in pe list
>                             cannot run in queue instance 
> "mymachine.q.3 at local29" because PE "mymachine.2.mpi" is not in pe list
>                             cannot run in queue instance 
> "mymachine.q.3 at local31" because PE "mymachine.2.mpi" is not in pe list
>                             cannot run because resources requested are not 
> available for parallel job
>                             cannot run because available slots combined 
> under PE "mymachine.2.mpi" are not in range of job
>                             cannot run because available slots combined 
> under PE "mymachine.3.mpi" are not in range of job
> 
> ----- Original Message ----- 
> From: "Tim Mueller" <tim_mueller at hotmail.com>
> To: <users at gridengine.sunsource.net>
> Sent: Friday, April 29, 2005 2:11 PM
> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> 
> 
> > That's what I had hoped initially.  However, it does not explain why no 
> > jobs get assigned to mymachine.q.3, which is the only queue to which they
> 
> > should get assigned.  It appears that jobs get rejected from this queue 
> > because the scheduler believes mymachine.3.mpi is too full.
> >
> > qstat -g t gives the following:
> >
> > job-ID  prior   name       user         state submit/start at     queue 
> > master ja-task-ID
> >
> 
-------------------------------------------------------------------------------
-----------------------------------
> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
> > mymachine.q.0 at local0                 SLAVE
> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
> > mymachine.q.0 at local0                 MASTER
> > 
> > mymachine.q.0 at local0 SLAVE
> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
> > mymachine.q.0 at local1                 SLAVE
> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
> > mymachine.q.0 at local1                 SLAVE
> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
> > mymachine.q.0 at local2                 SLAVE
> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
> > mymachine.q.0 at local2                 SLAVE
> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
> > mymachine.q.0 at local3                 SLAVE
> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
> > mymachine.q.0 at local3                 SLAVE
> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
> > mymachine.q.0 at local4                 SLAVE
> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
> > mymachine.q.0 at local4                 SLAVE
> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
> > mymachine.q.0 at local5                 SLAVE
> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
> > mymachine.q.0 at local5                 SLAVE
> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
> > mymachine.q.0 at local6                 MASTER
> > 
> > mymachine.q.0 at local6 SLAVE
> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
> > mymachine.q.0 at local6                 SLAVE
> >     47 0.55500 Job2        user        r     04/27/2005 14:45:04 
> > mymachine.q.0 at local7                 SLAVE
> >     59 0.55500 Job1        user        r     04/29/2005 10:47:08 
> > mymachine.q.0 at local7                 SLAVE
> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
> > mymachine.q.1 at local10                SLAVE
> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
> > mymachine.q.1 at local10                SLAVE
> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
> > mymachine.q.1 at local11                SLAVE
> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
> > mymachine.q.1 at local11                SLAVE
> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
> > mymachine.q.1 at local12                MASTER
> > 
> > mymachine.q.1 at local12 SLAVE
> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
> > mymachine.q.1 at local12                SLAVE
> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
> > mymachine.q.1 at local13                SLAVE
> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
> > mymachine.q.1 at local13                SLAVE
> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
> > mymachine.q.1 at local14                SLAVE
> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
> > mymachine.q.1 at local14                SLAVE
> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
> > mymachine.q.1 at local15                SLAVE
> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
> > mymachine.q.1 at local15                SLAVE
> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
> > mymachine.q.1 at local8                 SLAVE
> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
> > mymachine.q.1 at local8                 SLAVE
> >     44 0.55500 Job3        user        r     04/27/2005 11:55:49 
> > mymachine.q.1 at local9                 SLAVE
> >     60 0.55500 Job4        user        r     04/29/2005 10:55:53 
> > mymachine.q.1 at local9                 MASTER
> > 
> > mymachine.q.1 at local9 SLAVE
> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
> > mymachine.q.2 at local16                SLAVE
> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
> > mymachine.q.2 at local16                MASTER
> > 
> > mymachine.q.2 at local16 SLAVE
> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
> > mymachine.q.2 at local17                SLAVE
> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
> > mymachine.q.2 at local17                SLAVE
> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
> > mymachine.q.2 at local18                SLAVE
> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
> > mymachine.q.2 at local18                SLAVE
> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
> > mymachine.q.2 at local19                SLAVE
> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
> > mymachine.q.2 at local19                SLAVE
> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
> > mymachine.q.2 at local20                MASTER
> > 
> > mymachine.q.2 at local20 SLAVE
> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
> > mymachine.q.2 at local20                SLAVE
> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
> > mymachine.q.2 at local21                SLAVE
> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
> > mymachine.q.2 at local21                SLAVE
> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
> > mymachine.q.2 at local22                SLAVE
> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
> > mymachine.q.2 at local22                SLAVE
> >     48 0.55500 Job6        user        r     04/27/2005 14:57:53 
> > mymachine.q.2 at local23                SLAVE
> >     49 0.55500 Job5        user        r     04/27/2005 15:01:53 
> > mymachine.q.2 at local23                SLAVE
> >     61 0.55500 Job7        user        qw    04/29/2005 11:19:54
> >
> > Tim
> >
> > ----- Original Message ----- 
> > From: "Reuti" <reuti at staff.uni-marburg.de>
> > To: <users at gridengine.sunsource.net>
> > Sent: Friday, April 29, 2005 1:24 PM
> > Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> >
> >
> >> Aha Tim,
> >>
> >> now I understand your setup. As the naming of the masterq, e.g.
> >> mymachine.q.1 at local9 is inside your intended configuration, what shows
> >>
> >> qstat -g t
> >>
> >> Maybe the output of the granted PE is just wrong, but all is working as
> >> intended? - Reuti
> >>
> >> Quoting Tim Mueller <tim_mueller at hotmail.com>:
> >>
> >>> There are 32 machines, each dual-processor with names
> >>>
> >>> local0
> >>> local1
> >>> ..
> >>> local31
> >>>
> >>> They are grouped together with four 8-port gigabit switches.  Each group
> 
> >>> was
> >>>
> >>> given a queue, a PE, and a hostgroup.  So for example @mymachine-0 
> >>> contains
> >>>
> >>> local0
> >>> local1
> >>> ..
> >>> local7
> >>>
> >>> local0-local7 are all connected via both the central cluster switch and
> 
> >>> a
> >>> local gigabit switch.
> >>>
> >>> I should also note that I am using hostname aliasing to ensure that the
> >>> ethernet interface connected to the gigabit switch is used by Grid 
> >>> Engine.
> >>> So I have a file host_aliases file set up as follows:
> >>>
> >>> local0 node0
> >>> local1 node1
> >>> ..
> >>> local31 node31
> >>>
> >>> Where "nodeX" is the primary hostname for each machine and resolves to 
> >>> the
> >>> interface that connets to the central cluster switch.  "localX" resolves
> 
> >>> to
> >>>
> >>> an address that connects via the gigabit interface if possible.  The
> >>> "localX" names do not resolve consistently across the cluster -- for 
> >>> example
> >>>
> >>> if I am on node0 and I ping local1, it will do so over the gigabit
> >>> interface.  However if I am on node31 and I ping local1, it will do so 
> >>> over
> >>>
> >>> the non-gigabit interface, because there is no gigabit connection 
> >>> between
> >>> node31 and node1.
> >>>
> >>> Tim
> >>>
> >>> ----- Original Message ----- 
> >>> From: "Reuti" <reuti at staff.uni-marburg.de>
> >>> To: <users at gridengine.sunsource.net>
> >>> Sent: Friday, April 29, 2005 12:15 PM
> >>> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> >>>
> >>>
> >>> > Tim,
> >>> >
> >>> > thanks, but I'm still not sure about your setup. You stated that you 
> >>> > have
> >>>
> >>> > 32
> >>> > dual machines. So you made a hostgroup @mymachine-0 with which 
> >>> > machines
> >>> > setup
> >>> > therein? - And why so many queues at all?
> >>> >
> >>> > CU - Reuti
> >>> >
> >>> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
> >>> >
> >>> >> Hi,
> >>> >>
> >>> >> I get:
> >>> >>
> >>> >>      59 0.55500 Job1    user        r     04/29/2005 10:47:08
> >>> >> mymachine.q.0 at local0                     8
> >>> >>        Full jobname:     Job1
> >>> >>        Master queue:     mymachine.q.0 at local0
> >>> >>        Requested PE:     *.mpi 8
> >>> >>        Granted PE:       mymachine.3.mpi 8
> >>> >>        Hard Resources:
> >>> >>        Soft Resources:
> >>> >>      47 0.55500 Job2    user        r     04/27/2005 14:45:04
> >>> >> mymachine.q.0 at local6                     8
> >>> >>        Full jobname:     Job2
> >>> >>        Master queue:     mymachine.q.0 at local6
> >>> >>        Requested PE:     *.mpi 8
> >>> >>        Granted PE:       mymachine.3.mpi 8
> >>> >>        Hard Resources:
> >>> >>        Soft Resources:
> >>> >>      44 0.55500 Job3    user        r     04/27/2005 11:55:49
> >>> >> mymachineq.1 at local12                    8
> >>> >>        Full jobname:     Job3
> >>> >>        Master queue:     mymachine.q.1 at local12
> >>> >>        Requested PE:     *.mpi 8
> >>> >>        Granted PE:       mymachine.3.mpi 8
> >>> >>        Hard Resources:
> >>> >>        Soft Resources:
> >>> >>      60 0.55500 Job4    user        r     04/29/2005 10:55:53
> >>> >> mymachine.q.1 at local9                     8
> >>> >>        Full jobname:     Job4
> >>> >>        Master queue:     mymachine.q.1 at local9
> >>> >>        Requested PE:     *.mpi 8
> >>> >>        Granted PE:       mymachine.3.mpi 8
> >>> >>        Hard Resources:
> >>> >>        Soft Resources:
> >>> >>      49 0.55500 Job5    user        r     04/27/2005 15:01:53
> >>> >> mymachine.q.2 at local16                    8
> >>> >>        Full jobname:     Job5
> >>> >>        Master queue:     mymachine.q.2 at local16
> >>> >>        Requested PE:     *.mpi 8
> >>> >>        Granted PE:       mymachine.3.mpi 8
> >>> >>        Hard Resources:
> >>> >>        Soft Resources:
> >>> >>      48 0.55500 Job6    user        r     04/27/2005 14:57:53
> >>> >> mymachine.q.2 at local20                    8
> >>> >>        Full jobname:     Job6
> >>> >>        Master queue:     mymachine.q.2 at local20
> >>> >>        Requested PE:     *.mpi 8
> >>> >>        Granted PE:       mymachine.3.mpi 8
> >>> >>        Hard Resources:
> >>> >>        Soft Resources:
> >>> >>      61 0.55500 Job7    user        r    04/29/2005 11:19:54
> >>> >> 8
> >>> >>        Full jobname:     Job7
> >>> >>        Requested PE:     *.mpi 8
> >>> >>        Hard Resources:
> >>> >>        Soft Resources:
> >>> >>
> >>> >> When I do qconf -sp mymachine.3.mpi, I get:
> >>> >>
> >>> >> pe_name           mymachine.3.mpi
> >>> >> slots             16
> >>> >> user_lists        NONE
> >>> >> xuser_lists       NONE
> >>> >> start_proc_args   /bin/true
> >>> >> stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
> >>> >> allocation_rule   $round_robin
> >>> >> control_slaves    TRUE
> >>> >> job_is_first_task FALSE
> >>> >> urgency_slots     avg
> >>> >>
> >>> >> When I do qconf -sq mymachine.q.0, I get
> >>> >>
> >>> >> qname                 mymachine.q.0
> >>> >> hostlist              @mymachine-0
> >>> >> seq_no                0
> >>> >> load_thresholds       NONE
> >>> >> suspend_thresholds    NONE
> >>> >> nsuspend              1
> >>> >> suspend_interval      00:05:00
> >>> >> priority              0
> >>> >> min_cpu_interval      00:05:00
> >>> >> processors            UNDEFINED
> >>> >> qtype                 BATCH INTERACTIVE
> >>> >> ckpt_list             NONE
> >>> >> pe_list               mymachine.0.mpi
> >>> >> rerun                 FALSE
> >>> >> slots                 2
> >>> >> tmpdir                /tmp
> >>> >> shell                 /bin/bash
> >>> >> prolog                NONE
> >>> >> epilog                NONE
> >>> >> shell_start_mode      posix_compliant
> >>> >> starter_method        NONE
> >>> >> suspend_method        NONE
> >>> >> resume_method         NONE
> >>> >> terminate_method      NONE
> >>> >> notify                00:00:60
> >>> >> owner_list            sgeadmin
> >>> >> user_lists            NONE
> >>> >> xuser_lists           NONE
> >>> >> subordinate_list      NONE
> >>> >> complex_values        NONE
> >>> >> projects              NONE
> >>> >> xprojects             NONE
> >>> >> calendar              NONE
> >>> >> initial_state         default
> >>> >> s_rt                  84:00:00
> >>> >> h_rt                  84:15:00
> >>> >> s_cpu                 INFINITY
> >>> >> h_cpu                 INFINITY
> >>> >> s_fsize               INFINITY
> >>> >> h_fsize               INFINITY
> >>> >> s_data                INFINITY
> >>> >> h_data                INFINITY
> >>> >> s_stack               INFINITY
> >>> >> h_stack               INFINITY
> >>> >> s_core                INFINITY
> >>> >> h_core                INFINITY
> >>> >> s_rss                 1G
> >>> >> h_rss                 1G
> >>> >> s_vmem                INFINITY
> >>> >> h_vmem                INFINITY
> >>> >>
> >>> >> And so on, up to mymachine.q.3.
> >>> >>
> >>> >> Tim
> >>> >>
> >>> >> ----- Original Message ----- 
> >>> >> From: "Reuti" <reuti at staff.uni-marburg.de>
> >>> >> To: <users at gridengine.sunsource.net>
> >>> >> Sent: Friday, April 29, 2005 11:14 AM
> >>> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> >>> >>
> >>> >>
> >>> >> > Hi Tim,
> >>> >> >
> >>> >> > what is:
> >>> >> >
> >>> >> > qstat -r
> >>> >> >
> >>> >> > showing as granted PEs? - Reuti
> >>> >> >
> >>> >> >
> >>> >> > Quoting Tim Mueller <tim_mueller at hotmail.com>:
> >>> >> >
> >>> >> >> Hi,
> >>> >> >>
> >>> >> >> That's the problem.  The setup is actually
> >>> >> >>
> >>> >> >> mymachine.q.0 references mymachine.0.mpi
> >>> >> >> mymachine.q.1 references mymachine.1.mpi
> >>> >> >> mymachine.q.2 references mymachine.2.mpi
> >>> >> >> mymachine.q.3 references mymachine.3.mpi
> >>> >> >>
> >>> >> >> There is no reason, as far as I can tell, that a job could ever be
> 
> >>> >> >> in
> >>> >> >> both
> >>> >> >> mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I use
> >>> >> >> wildcards,
> >>> >> >>
> >>> >> >> the the scheduler won't put a job assigned to mymachine.3.mpi
> into
> >>> >> >> mymachine.q.3 until all of the other queues are full.  At that 
> >>> >> >> point,
> >>> >> >> it's
> >>> >> >> too late because mymachine.3.mpi is using 48 slots, when it's
> only
> >>> >> >> allowed
> >>> >> >> to use up to 16.
> >>> >> >>
> >>> >> >> When I don't use wildcards, I get the behavior I expect:  A job
> >>> >> >> submitted
> >>> >>
> >>> >> >> to
> >>> >> >>
> >>> >> >> mymachine.3.mpi gets put in mymachine.q.3, etc.
> >>> >> >>
> >>> >> >> Tim
> >>> >> >>
> >>> >> >> ----- Original Message ----- 
> >>> >> >> From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
> >>> >> >> <stephan.grell at sun.com>
> >>> >> >> To: <users at gridengine.sunsource.net>
> >>> >> >> Sent: Friday, April 29, 2005 2:34 AM
> >>> >> >> Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
> >>> >> >>
> >>> >> >>
> >>> >> >> > Hi Tim,
> >>> >> >> >
> >>> >> >> > I am not quite sure I understand your setup. Could you please
> >>> attach
> >>> >> >> > your
> >>> >> >>
> >>> >> >> > cqueue configuration? From
> >>> >> >> > the results you posted, it reads as if:
> >>> >> >> > queue
> >>> >> >> > mymachine.q.0  references mymachine.3.mpi
> >>> >> >> > mymachine.q.1  reference mymachine.3.mpi
> >>> >> >> >
> >>> >> >> > and so on.
> >>> >> >> >
> >>> >> >> > Cheers,
> >>> >> >> > Stephan
> >>> >> >> >
> >>> >> >> > Tim Mueller wrote:
> >>> >> >> >
> >>> >> >> >> Hi,
> >>> >> >> >>  It appears that wildcards in the Parallel Environment name 
> >>> >> >> >> still
> >>> >> have
> >>> >> >> >> problems in 6.0u3.  I have set up a linux cluster of 32 dual
> >>> >> processor
> >>> >> >> >> Noconas running Linux.  There are 4 queues of 16 processors 
> >>> >> >> >> each,
> >>> >> >> >> and
> >>> >>
> >>> >> >> >> a
> >>> >> >> >> corresponding pe for each queue.  The queues are named as 
> >>> >> >> >> follows:
> >>> >> >> >>  mymachine.q.0
> >>> >> >> >> mymachine.q.1
> >>> >> >> >> mymachine.q.2
> >>> >> >> >> mymachine.q.3
> >>> >> >> >>  And the PE's are
> >>> >> >> >>  mymachine.0.mpi
> >>> >> >> >> mymachine.1.mpi
> >>> >> >> >> mymachine.2.mpi
> >>> >> >> >> mymachine.3.mpi
> >>> >> >> >>  All of the PE's have 16 slots.  When I submit a job with the
> >>> >> >> >> following
> >>> >> >> >> line:
> >>> >> >> >>  #$ -pe *.mpi 8
> >>> >> >> >>  the job will be assigned to a seemingly random PE, but then 
> >>> >> >> >> placed
> >>>
> >>> >> >> >> in
> >>> >>
> >>> >> >> >> a
> >>> >> >>
> >>> >> >> >> queue that does not correspond to that PE.  I can submit up to
> 
> >>> >> >> >> 6
> >>> >> >> >> jobs
> >>> >> >> >> this way, each of which will get assigned to the same PE and
> >>> placed
> >>> >> in
> >>> >> >> >> any queue that does not correspond to the PE.  This causes 48
> >>> >> >> >> processors
> >>> >> >>
> >>> >> >> >> to be used for a PE with only 16 slots.  E.g., I might get:
> >>> >> >> >>  Job 1        mymachine.3.mpi        mymachine.q.0        8
> >>> >> processors
> >>> >> >> >> Job 2        mymachine.3.mpi        mymachine.q.0        8
> >>> >> >> >> processors
> >>> >> >> >> Job 3        mymachine.3.mpi        mymachine.q.1        8
> >>> >> >> >> processors
> >>> >> >> >> Job 4        mymachine.3.mpi        mymachine.q.1        8
> >>> >> >> >> processors
> >>> >> >> >> Job 5        mymachine.3.mpi        mymachine.q.2        8
> >>> >> >> >> processors
> >>> >> >> >> Job 6        mymachine.3.mpi        mymachine.q.2        8
> >>> >> >> >> processors
> >>> >> >> >> Job 7        qw
> >>> >> >> >> Job 8        qw
> >>> >> >> >>  When I should get:
> >>> >> >> >>  Job 1        mymachine.0.mpi        mymachine.q.0        8
> >>> >> processors
> >>> >> >> >> Job 2        mymachine.0.mpi        mymachine.q.0        8
> >>> >> >> >> processors
> >>> >> >> >> Job 3        mymachine.1.mpi        mymachine.q.1        8
> >>> >> >> >> processors
> >>> >> >> >> Job 4        mymachine.1.mpi        mymachine.q.1        8
> >>> >> >> >> processors
> >>> >> >> >> Job 5        mymachine.2.mpi        mymachine.q.2        8
> >>> >> >> >> processors
> >>> >> >> >> Job 6        mymachine.2.mpi        mymachine.q.2        8
> >>> >> >> >> processors
> >>> >> >> >> Job 5        mymachine.3.mpi        mymachine.q.3        8
> >>> >> >> >> processors
> >>> >> >> >> Job 6        mymachine.3.mpi        mymachine.q.3        8
> >>> >> >> >> processors
> >>> >> >> >>  If I try to then submit a job directly (with no wildcard) to 
> >>> >> >> >> the
> >>> >> >> >> PE
> >>> >> >> >> that
> >>> >> >>
> >>> >> >> >> all of the jobs were assigned to, it will not run because I 
> >>> >> >> >> have
> >>> >> >> >> already
> >>> >> >>
> >>> >> >> >> far exceeded the slots limit for this PE.
> >>> >> >> >>  I should note that when I do not use wildcards, everything 
> >>> >> >> >> behaves
> >>>
> >>> >> >> >> as
> >>> >>
> >>> >> >> >> it
> >>> >> >>
> >>> >> >> >> should.  E.g, a job submitted to mymachine.2.mpi will be 
> >>> >> >> >> assigned
> >>> >> >> >> to
> >>> >> >> >> mymachine.2.mpi and mymachine.2.q, and I cannot use more than 
> >>> >> >> >> 16
> >>> >> >> >> slots
> >>> >>
> >>> >> >> >> in
> >>> >> >>
> >>> >> >> >> mymachine.2.mpi at once.
> >>> >> >> >>  I searched the list, and although there seem to have been 
> >>> >> >> >> other
> >>> >> >> >> problems
> >>> >> >>
> >>> >> >> >> with wildcards in the past, I have seen nothing that
> references
> >>> >> >> >> this
> >>> >> >> >> behavior.  Does anyone have an explanation / workaround?
> >>> >> >> >>  Tim
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> >> >> >
> >>> ---------------------------------------------------------------------
> >>> >> >> > To unsubscribe, e-mail: 
> >>> >> >> > users-unsubscribe at gridengine.sunsource.net
> >>> >> >> > For additional commands, e-mail:
> >>> users-help at gridengine.sunsource.net
> >>> >> >> >
> >>> >> >> >
> >>> >> >>
> >>> >> >>
> ---------------------------------------------------------------------
> >>> >> >> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> >>> >> >> For additional commands, e-mail: 
> >>> >> >> users-help at gridengine.sunsource.net
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >
> ---------------------------------------------------------------------
> >>> >> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> >> > For additional commands, e-mail: 
> >>> >> > users-help at gridengine.sunsource.net
> >>> >> >
> >>> >> >
> >>> >>
> >>> >>
> ---------------------------------------------------------------------
> >>> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > ---------------------------------------------------------------------
> >>> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>> >
> >>> >
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >>> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>>
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> >> For additional commands, e-mail: users-help at gridengine.sunsource.net
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail: users-help at gridengine.sunsource.net
> >
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list