[GE users] Wildcards in PE still broken in 6.0u3

Reuti reuti at staff.uni-marburg.de
Mon May 2 11:23:06 BST 2005


Hi Stephan,

the problem is not the range of slots, which is selected for a parallel 
job, but it is a selected queue/PE mismatch. In 6.0u1 all is working 
fine (as I observed), but in 6.0u3 you get this behavior.

You have four PEs like:

$ qconf -sp mymachine.0.mpi
pe_name           mymachine.0.mpi
slots             16
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   $round_robin
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min

and for "mymachine.1.mpi", "mymachine.2.mpi", "mymachine.3.mpi" similar.

Then attach one PE to one queue like (and for {1,2,3} same):

$ qconf -sq mymachine.q.0
qname                 mymachine.q.0
hostlist              @mymachine-0
...
pe_list               mymachine.0.mpi
...

The @mymachine-0 is:

$ qconf -shgrp @mymachine-0
group_name @mymachine-0
hostlist ic001 ic002 ic003 ic004 ic005 ic006 ic007 ic008
$ qconf -shgrp @mymachine-1
group_name @mymachine-1
hostlist ic009 ic010 ic011 ic012 ic013 ic014 ic015 ic016

and so on for 2 and 3.

With this setup, you can force a job to stay in "@mymachine-0" group by 
using "-pe mymachine.*.mpi".

Now I submit a parallel job:

$ qsub -pe mymachine.*.mpi 4 waiter.sh
Your job 458 ("waiter.sh") has been submitted.
$ qstat -r
job-ID  prior   name       user         state submit/start at     queue 
                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
     458 0.55500 waiter.sh  reuti        r     05/02/2005 12:12:37 
mymachine.q.0 at ic004                4
        Full jobname:     waiter.sh
        Master queue:     mymachine.q.0 at ic004
        Requested PE:     mymachine.*.mpi 4
        Granted PE:       mymachine.3.mpi 4
        Hard Resources:
        Soft Resources:

And already here the problem can be seen: it's running in the queue 
"mymachine.q.0" with master host "mymachine.q.0 at ic004". But the granted 
PE is "mymachine.3.mpi". This makes no sense, as the attached PE to 
"mymachine.q.0" is "mymachine.0.mpi" - not "mymachine.3.mpi".

The "mymachine.3.mpi" is only attached to "mymachine.q.3".

The counting for used slots is really all going really to the PE 
"mymachine.3.mpi". With 16 slots granted there will be already a count 
of -32 (16 -  3 * 16 (mymachine.{0,1,2}.mpi)) for "mymachine.3.mpi". So 
the real "mymachine.3.mpi" will never get a job. Workaround: give 
"mymachine.3.mpi" 999 slots.


Cheers - Reuti


Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
> Hi,
> 
> I tried to replicate the bug and could not. I used 6.0u4. From what I
> get out
> of the description, it is bug 1216, which was found in 6.0u1 and fixed
> in 6.0u2.
> 
> I do not get the current version discussion. It would also be good to
> have the qsub
> line and the pe config.
> 
> Cheers,
> Stephan
> 
> Reuti wrote:
> 
> 
>>Hi Tim,
>>
>>I could reproduce the weird behavior. Can you please file a bug? As I found, in 
>>6.0u1 it was still working, it must be introduced in one of the following 
>>releases. Seems that there is now also an order in taking the slots for the PEs 
>>- the one from mymachine.q.0 are taken first, then the ones from mymachine.q.1 
>>..
>>
>>Cheers - Reuti
>>
>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>
>> 
>>
>>
>>>Some more information...  If I run qstat -j 61, I get the output below.
>>>
>>>Tim
>>>
>>>....................
>>>JOB INFO CUT
>>>......................
>>>script_file:                Job7
>>>parallel environment:  *.mpi range: 8
>>>scheduling info:            queue instance "mymachine.q.0 at local0" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.0 at local1" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.0 at local2" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.0 at local3" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.0 at local4" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.0 at local5" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.0 at local6" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.0 at local7" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.1 at local10" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.1 at local11" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.1 at local12" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.1 at local13" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.1 at local14" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.1 at local15" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.1 at local8" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.1 at local9" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.2 at local16" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.2 at local17" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.2 at local18" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.2 at local19" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.2 at local20" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.2 at local21" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.2 at local22" dropped 
>>>because it is full
>>>                           queue instance "mymachine.q.2 at local23" dropped 
>>>because it is full
>>>                           cannot run in queue instance 
>>>"mymachine.q.3 at local30" because PE "mymachine.2.mpi" is not in pe list
>>>                           cannot run in queue instance 
>>>"mymachine.q.3 at local26" because PE "mymachine.2.mpi" is not in pe list
>>>                           cannot run in queue instance 
>>>"mymachine.q.3 at local25" because PE "mymachine.2.mpi" is not in pe list
>>>                           cannot run in queue instance 
>>>"mymachine.q.3 at local24" because PE "mymachine.2.mpi" is not in pe list
>>>                           cannot run in queue instance 
>>>"mymachine.q.3 at local27" because PE "mymachine.2.mpi" is not in pe list
>>>                           cannot run in queue instance 
>>>"mymachine.q.3 at local28" because PE "mymachine.2.mpi" is not in pe list
>>>                           cannot run in queue instance 
>>>"mymachine.q.3 at local29" because PE "mymachine.2.mpi" is not in pe list
>>>                           cannot run in queue instance 
>>>"mymachine.q.3 at local31" because PE "mymachine.2.mpi" is not in pe list
>>>                           cannot run because resources requested are not 
>>>available for parallel job
>>>                           cannot run because available slots combined 
>>>under PE "mymachine.2.mpi" are not in range of job
>>>                           cannot run because available slots combined 
>>>under PE "mymachine.3.mpi" are not in range of job
>>>
>>>----- Original Message ----- 
>>>From: "Tim Mueller" <tim_mueller at hotmail.com>
>>>To: <users at gridengine.sunsource.net>
>>>Sent: Friday, April 29, 2005 2:11 PM
>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>
>>>
>>>   
>>>
>>>
>>>>That's what I had hoped initially.  However, it does not explain why no 
>>>>jobs get assigned to mymachine.q.3, which is the only queue to which they
>>>>     
>>>>
>>>>should get assigned.  It appears that jobs get rejected from this queue 
>>>>because the scheduler believes mymachine.3.mpi is too full.
>>>>
>>>>qstat -g t gives the following:
>>>>
>>>>job-ID  prior   name       user         state submit/start at     queue 
>>>>master ja-task-ID
>>>>
>>>>     
>>>>
>>
>>-------------------------------------------------------------------------------
>>-----------------------------------
>> 
>>
>>
>>>>   47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>mymachine.q.0 at local0                 SLAVE
>>>>   59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>mymachine.q.0 at local0                 MASTER
>>>>
>>>>mymachine.q.0 at local0 SLAVE
>>>>   47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>mymachine.q.0 at local1                 SLAVE
>>>>   59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>mymachine.q.0 at local1                 SLAVE
>>>>   47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>mymachine.q.0 at local2                 SLAVE
>>>>   59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>mymachine.q.0 at local2                 SLAVE
>>>>   47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>mymachine.q.0 at local3                 SLAVE
>>>>   59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>mymachine.q.0 at local3                 SLAVE
>>>>   47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>mymachine.q.0 at local4                 SLAVE
>>>>   59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>mymachine.q.0 at local4                 SLAVE
>>>>   47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>mymachine.q.0 at local5                 SLAVE
>>>>   59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>mymachine.q.0 at local5                 SLAVE
>>>>   47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>mymachine.q.0 at local6                 MASTER
>>>>
>>>>mymachine.q.0 at local6 SLAVE
>>>>   59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>mymachine.q.0 at local6                 SLAVE
>>>>   47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>mymachine.q.0 at local7                 SLAVE
>>>>   59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>mymachine.q.0 at local7                 SLAVE
>>>>   44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>mymachine.q.1 at local10                SLAVE
>>>>   60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>mymachine.q.1 at local10                SLAVE
>>>>   44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>mymachine.q.1 at local11                SLAVE
>>>>   60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>mymachine.q.1 at local11                SLAVE
>>>>   44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>mymachine.q.1 at local12                MASTER
>>>>
>>>>mymachine.q.1 at local12 SLAVE
>>>>   60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>mymachine.q.1 at local12                SLAVE
>>>>   44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>mymachine.q.1 at local13                SLAVE
>>>>   60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>mymachine.q.1 at local13                SLAVE
>>>>   44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>mymachine.q.1 at local14                SLAVE
>>>>   60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>mymachine.q.1 at local14                SLAVE
>>>>   44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>mymachine.q.1 at local15                SLAVE
>>>>   60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>mymachine.q.1 at local15                SLAVE
>>>>   44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>mymachine.q.1 at local8                 SLAVE
>>>>   60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>mymachine.q.1 at local8                 SLAVE
>>>>   44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>mymachine.q.1 at local9                 SLAVE
>>>>   60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>mymachine.q.1 at local9                 MASTER
>>>>
>>>>mymachine.q.1 at local9 SLAVE
>>>>   48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>mymachine.q.2 at local16                SLAVE
>>>>   49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>mymachine.q.2 at local16                MASTER
>>>>
>>>>mymachine.q.2 at local16 SLAVE
>>>>   48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>mymachine.q.2 at local17                SLAVE
>>>>   49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>mymachine.q.2 at local17                SLAVE
>>>>   48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>mymachine.q.2 at local18                SLAVE
>>>>   49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>mymachine.q.2 at local18                SLAVE
>>>>   48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>mymachine.q.2 at local19                SLAVE
>>>>   49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>mymachine.q.2 at local19                SLAVE
>>>>   48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>mymachine.q.2 at local20                MASTER
>>>>
>>>>mymachine.q.2 at local20 SLAVE
>>>>   49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>mymachine.q.2 at local20                SLAVE
>>>>   48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>mymachine.q.2 at local21                SLAVE
>>>>   49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>mymachine.q.2 at local21                SLAVE
>>>>   48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>mymachine.q.2 at local22                SLAVE
>>>>   49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>mymachine.q.2 at local22                SLAVE
>>>>   48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>mymachine.q.2 at local23                SLAVE
>>>>   49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>mymachine.q.2 at local23                SLAVE
>>>>   61 0.55500 Job7        user        qw    04/29/2005 11:19:54
>>>>
>>>>Tim
>>>>
>>>>----- Original Message ----- 
>>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>>To: <users at gridengine.sunsource.net>
>>>>Sent: Friday, April 29, 2005 1:24 PM
>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>
>>>>
>>>>     
>>>>
>>>>
>>>>>Aha Tim,
>>>>>
>>>>>now I understand your setup. As the naming of the masterq, e.g.
>>>>>mymachine.q.1 at local9 is inside your intended configuration, what shows
>>>>>
>>>>>qstat -g t
>>>>>
>>>>>Maybe the output of the granted PE is just wrong, but all is working as
>>>>>intended? - Reuti
>>>>>
>>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>>
>>>>>       
>>>>>
>>>>>
>>>>>>There are 32 machines, each dual-processor with names
>>>>>>
>>>>>>local0
>>>>>>local1
>>>>>>..
>>>>>>local31
>>>>>>
>>>>>>They are grouped together with four 8-port gigabit switches.  Each group
>>>>>>         
>>>>>>
>>>>>>was
>>>>>>
>>>>>>given a queue, a PE, and a hostgroup.  So for example @mymachine-0 
>>>>>>contains
>>>>>>
>>>>>>local0
>>>>>>local1
>>>>>>..
>>>>>>local7
>>>>>>
>>>>>>local0-local7 are all connected via both the central cluster switch and
>>>>>>         
>>>>>>
>>>>>>a
>>>>>>local gigabit switch.
>>>>>>
>>>>>>I should also note that I am using hostname aliasing to ensure that the
>>>>>>ethernet interface connected to the gigabit switch is used by Grid 
>>>>>>Engine.
>>>>>>So I have a file host_aliases file set up as follows:
>>>>>>
>>>>>>local0 node0
>>>>>>local1 node1
>>>>>>..
>>>>>>local31 node31
>>>>>>
>>>>>>Where "nodeX" is the primary hostname for each machine and resolves to 
>>>>>>the
>>>>>>interface that connets to the central cluster switch.  "localX" resolves
>>>>>>         
>>>>>>
>>>>>>to
>>>>>>
>>>>>>an address that connects via the gigabit interface if possible.  The
>>>>>>"localX" names do not resolve consistently across the cluster -- for 
>>>>>>example
>>>>>>
>>>>>>if I am on node0 and I ping local1, it will do so over the gigabit
>>>>>>interface.  However if I am on node31 and I ping local1, it will do so 
>>>>>>over
>>>>>>
>>>>>>the non-gigabit interface, because there is no gigabit connection 
>>>>>>between
>>>>>>node31 and node1.
>>>>>>
>>>>>>Tim
>>>>>>
>>>>>>----- Original Message ----- 
>>>>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>>>>To: <users at gridengine.sunsource.net>
>>>>>>Sent: Friday, April 29, 2005 12:15 PM
>>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>>
>>>>>>
>>>>>>         
>>>>>>
>>>>>>
>>>>>>>Tim,
>>>>>>>
>>>>>>>thanks, but I'm still not sure about your setup. You stated that you 
>>>>>>>have
>>>>>>>           
>>>>>>>
>>>>>>>32
>>>>>>>dual machines. So you made a hostgroup @mymachine-0 with which 
>>>>>>>machines
>>>>>>>setup
>>>>>>>therein? - And why so many queues at all?
>>>>>>>
>>>>>>>CU - Reuti
>>>>>>>
>>>>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>>>>
>>>>>>>           
>>>>>>>
>>>>>>>
>>>>>>>>Hi,
>>>>>>>>
>>>>>>>>I get:
>>>>>>>>
>>>>>>>>    59 0.55500 Job1    user        r     04/29/2005 10:47:08
>>>>>>>>mymachine.q.0 at local0                     8
>>>>>>>>      Full jobname:     Job1
>>>>>>>>      Master queue:     mymachine.q.0 at local0
>>>>>>>>      Requested PE:     *.mpi 8
>>>>>>>>      Granted PE:       mymachine.3.mpi 8
>>>>>>>>      Hard Resources:
>>>>>>>>      Soft Resources:
>>>>>>>>    47 0.55500 Job2    user        r     04/27/2005 14:45:04
>>>>>>>>mymachine.q.0 at local6                     8
>>>>>>>>      Full jobname:     Job2
>>>>>>>>      Master queue:     mymachine.q.0 at local6
>>>>>>>>      Requested PE:     *.mpi 8
>>>>>>>>      Granted PE:       mymachine.3.mpi 8
>>>>>>>>      Hard Resources:
>>>>>>>>      Soft Resources:
>>>>>>>>    44 0.55500 Job3    user        r     04/27/2005 11:55:49
>>>>>>>>mymachineq.1 at local12                    8
>>>>>>>>      Full jobname:     Job3
>>>>>>>>      Master queue:     mymachine.q.1 at local12
>>>>>>>>      Requested PE:     *.mpi 8
>>>>>>>>      Granted PE:       mymachine.3.mpi 8
>>>>>>>>      Hard Resources:
>>>>>>>>      Soft Resources:
>>>>>>>>    60 0.55500 Job4    user        r     04/29/2005 10:55:53
>>>>>>>>mymachine.q.1 at local9                     8
>>>>>>>>      Full jobname:     Job4
>>>>>>>>      Master queue:     mymachine.q.1 at local9
>>>>>>>>      Requested PE:     *.mpi 8
>>>>>>>>      Granted PE:       mymachine.3.mpi 8
>>>>>>>>      Hard Resources:
>>>>>>>>      Soft Resources:
>>>>>>>>    49 0.55500 Job5    user        r     04/27/2005 15:01:53
>>>>>>>>mymachine.q.2 at local16                    8
>>>>>>>>      Full jobname:     Job5
>>>>>>>>      Master queue:     mymachine.q.2 at local16
>>>>>>>>      Requested PE:     *.mpi 8
>>>>>>>>      Granted PE:       mymachine.3.mpi 8
>>>>>>>>      Hard Resources:
>>>>>>>>      Soft Resources:
>>>>>>>>    48 0.55500 Job6    user        r     04/27/2005 14:57:53
>>>>>>>>mymachine.q.2 at local20                    8
>>>>>>>>      Full jobname:     Job6
>>>>>>>>      Master queue:     mymachine.q.2 at local20
>>>>>>>>      Requested PE:     *.mpi 8
>>>>>>>>      Granted PE:       mymachine.3.mpi 8
>>>>>>>>      Hard Resources:
>>>>>>>>      Soft Resources:
>>>>>>>>    61 0.55500 Job7    user        r    04/29/2005 11:19:54
>>>>>>>>8
>>>>>>>>      Full jobname:     Job7
>>>>>>>>      Requested PE:     *.mpi 8
>>>>>>>>      Hard Resources:
>>>>>>>>      Soft Resources:
>>>>>>>>
>>>>>>>>When I do qconf -sp mymachine.3.mpi, I get:
>>>>>>>>
>>>>>>>>pe_name           mymachine.3.mpi
>>>>>>>>slots             16
>>>>>>>>user_lists        NONE
>>>>>>>>xuser_lists       NONE
>>>>>>>>start_proc_args   /bin/true
>>>>>>>>stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
>>>>>>>>allocation_rule   $round_robin
>>>>>>>>control_slaves    TRUE
>>>>>>>>job_is_first_task FALSE
>>>>>>>>urgency_slots     avg
>>>>>>>>
>>>>>>>>When I do qconf -sq mymachine.q.0, I get
>>>>>>>>
>>>>>>>>qname                 mymachine.q.0
>>>>>>>>hostlist              @mymachine-0
>>>>>>>>seq_no                0
>>>>>>>>load_thresholds       NONE
>>>>>>>>suspend_thresholds    NONE
>>>>>>>>nsuspend              1
>>>>>>>>suspend_interval      00:05:00
>>>>>>>>priority              0
>>>>>>>>min_cpu_interval      00:05:00
>>>>>>>>processors            UNDEFINED
>>>>>>>>qtype                 BATCH INTERACTIVE
>>>>>>>>ckpt_list             NONE
>>>>>>>>pe_list               mymachine.0.mpi
>>>>>>>>rerun                 FALSE
>>>>>>>>slots                 2
>>>>>>>>tmpdir                /tmp
>>>>>>>>shell                 /bin/bash
>>>>>>>>prolog                NONE
>>>>>>>>epilog                NONE
>>>>>>>>shell_start_mode      posix_compliant
>>>>>>>>starter_method        NONE
>>>>>>>>suspend_method        NONE
>>>>>>>>resume_method         NONE
>>>>>>>>terminate_method      NONE
>>>>>>>>notify                00:00:60
>>>>>>>>owner_list            sgeadmin
>>>>>>>>user_lists            NONE
>>>>>>>>xuser_lists           NONE
>>>>>>>>subordinate_list      NONE
>>>>>>>>complex_values        NONE
>>>>>>>>projects              NONE
>>>>>>>>xprojects             NONE
>>>>>>>>calendar              NONE
>>>>>>>>initial_state         default
>>>>>>>>s_rt                  84:00:00
>>>>>>>>h_rt                  84:15:00
>>>>>>>>s_cpu                 INFINITY
>>>>>>>>h_cpu                 INFINITY
>>>>>>>>s_fsize               INFINITY
>>>>>>>>h_fsize               INFINITY
>>>>>>>>s_data                INFINITY
>>>>>>>>h_data                INFINITY
>>>>>>>>s_stack               INFINITY
>>>>>>>>h_stack               INFINITY
>>>>>>>>s_core                INFINITY
>>>>>>>>h_core                INFINITY
>>>>>>>>s_rss                 1G
>>>>>>>>h_rss                 1G
>>>>>>>>s_vmem                INFINITY
>>>>>>>>h_vmem                INFINITY
>>>>>>>>
>>>>>>>>And so on, up to mymachine.q.3.
>>>>>>>>
>>>>>>>>Tim
>>>>>>>>
>>>>>>>>----- Original Message ----- 
>>>>>>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>>>>>>To: <users at gridengine.sunsource.net>
>>>>>>>>Sent: Friday, April 29, 2005 11:14 AM
>>>>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>>>>
>>>>>>>>
>>>>>>>>             
>>>>>>>>
>>>>>>>>
>>>>>>>>>Hi Tim,
>>>>>>>>>
>>>>>>>>>what is:
>>>>>>>>>
>>>>>>>>>qstat -r
>>>>>>>>>
>>>>>>>>>showing as granted PEs? - Reuti
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>Hi,
>>>>>>>>>>
>>>>>>>>>>That's the problem.  The setup is actually
>>>>>>>>>>
>>>>>>>>>>mymachine.q.0 references mymachine.0.mpi
>>>>>>>>>>mymachine.q.1 references mymachine.1.mpi
>>>>>>>>>>mymachine.q.2 references mymachine.2.mpi
>>>>>>>>>>mymachine.q.3 references mymachine.3.mpi
>>>>>>>>>>
>>>>>>>>>>There is no reason, as far as I can tell, that a job could ever be
>>>>>>>>>>                 
>>>>>>>>>>
>>>>>>>>>>in
>>>>>>>>>>both
>>>>>>>>>>mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I use
>>>>>>>>>>wildcards,
>>>>>>>>>>
>>>>>>>>>>the the scheduler won't put a job assigned to mymachine.3.mpi
>>>>>>>>>>                 
>>>>>>>>>>
>>>
>>>into
>>>   
>>>
>>>
>>>>>>>>>>mymachine.q.3 until all of the other queues are full.  At that 
>>>>>>>>>>point,
>>>>>>>>>>it's
>>>>>>>>>>too late because mymachine.3.mpi is using 48 slots, when it's
>>>>>>>>>>                 
>>>>>>>>>>
>>>
>>>only
>>>   
>>>
>>>
>>>>>>>>>>allowed
>>>>>>>>>>to use up to 16.
>>>>>>>>>>
>>>>>>>>>>When I don't use wildcards, I get the behavior I expect:  A job
>>>>>>>>>>submitted
>>>>>>>>>>                 
>>>>>>>>>>
>>>>>>>>>>to
>>>>>>>>>>
>>>>>>>>>>mymachine.3.mpi gets put in mymachine.q.3, etc.
>>>>>>>>>>
>>>>>>>>>>Tim
>>>>>>>>>>
>>>>>>>>>>----- Original Message ----- 
>>>>>>>>>>From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
>>>>>>>>>><stephan.grell at sun.com>
>>>>>>>>>>To: <users at gridengine.sunsource.net>
>>>>>>>>>>Sent: Friday, April 29, 2005 2:34 AM
>>>>>>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>Hi Tim,
>>>>>>>>>>>
>>>>>>>>>>>I am not quite sure I understand your setup. Could you please
>>>>>>>>>>>                   
>>>>>>>>>>>
>>>>>>
>>>>>>attach
>>>>>>         
>>>>>>
>>>>>>
>>>>>>>>>>>your
>>>>>>>>>>>                   
>>>>>>>>>>>
>>>>>>>>>>>cqueue configuration? From
>>>>>>>>>>>the results you posted, it reads as if:
>>>>>>>>>>>queue
>>>>>>>>>>>mymachine.q.0  references mymachine.3.mpi
>>>>>>>>>>>mymachine.q.1  reference mymachine.3.mpi
>>>>>>>>>>>
>>>>>>>>>>>and so on.
>>>>>>>>>>>
>>>>>>>>>>>Cheers,
>>>>>>>>>>>Stephan
>>>>>>>>>>>
>>>>>>>>>>>Tim Mueller wrote:
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>Hi,
>>>>>>>>>>>>It appears that wildcards in the Parallel Environment name 
>>>>>>>>>>>>still
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>have
>>>>>>>>             
>>>>>>>>
>>>>>>>>
>>>>>>>>>>>>problems in 6.0u3.  I have set up a linux cluster of 32 dual
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>processor
>>>>>>>>             
>>>>>>>>
>>>>>>>>
>>>>>>>>>>>>Noconas running Linux.  There are 4 queues of 16 processors 
>>>>>>>>>>>>each,
>>>>>>>>>>>>and
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>a
>>>>>>>>>>>>corresponding pe for each queue.  The queues are named as 
>>>>>>>>>>>>follows:
>>>>>>>>>>>>mymachine.q.0
>>>>>>>>>>>>mymachine.q.1
>>>>>>>>>>>>mymachine.q.2
>>>>>>>>>>>>mymachine.q.3
>>>>>>>>>>>>And the PE's are
>>>>>>>>>>>>mymachine.0.mpi
>>>>>>>>>>>>mymachine.1.mpi
>>>>>>>>>>>>mymachine.2.mpi
>>>>>>>>>>>>mymachine.3.mpi
>>>>>>>>>>>>All of the PE's have 16 slots.  When I submit a job with the
>>>>>>>>>>>>following
>>>>>>>>>>>>line:
>>>>>>>>>>>>#$ -pe *.mpi 8
>>>>>>>>>>>>the job will be assigned to a seemingly random PE, but then 
>>>>>>>>>>>>placed
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>in
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>a
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>queue that does not correspond to that PE.  I can submit up to
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>6
>>>>>>>>>>>>jobs
>>>>>>>>>>>>this way, each of which will get assigned to the same PE and
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>
>>>>>>placed
>>>>>>         
>>>>>>
>>>>>>
>>>>>>>>in
>>>>>>>>             
>>>>>>>>
>>>>>>>>
>>>>>>>>>>>>any queue that does not correspond to the PE.  This causes 48
>>>>>>>>>>>>processors
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>to be used for a PE with only 16 slots.  E.g., I might get:
>>>>>>>>>>>>Job 1        mymachine.3.mpi        mymachine.q.0        8
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>processors
>>>>>>>>             
>>>>>>>>
>>>>>>>>
>>>>>>>>>>>>Job 2        mymachine.3.mpi        mymachine.q.0        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 3        mymachine.3.mpi        mymachine.q.1        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 4        mymachine.3.mpi        mymachine.q.1        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 5        mymachine.3.mpi        mymachine.q.2        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 6        mymachine.3.mpi        mymachine.q.2        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 7        qw
>>>>>>>>>>>>Job 8        qw
>>>>>>>>>>>>When I should get:
>>>>>>>>>>>>Job 1        mymachine.0.mpi        mymachine.q.0        8
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>
>>>>>>>>processors
>>>>>>>>             
>>>>>>>>
>>>>>>>>
>>>>>>>>>>>>Job 2        mymachine.0.mpi        mymachine.q.0        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 3        mymachine.1.mpi        mymachine.q.1        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 4        mymachine.1.mpi        mymachine.q.1        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 5        mymachine.2.mpi        mymachine.q.2        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 6        mymachine.2.mpi        mymachine.q.2        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 5        mymachine.3.mpi        mymachine.q.3        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>Job 6        mymachine.3.mpi        mymachine.q.3        8
>>>>>>>>>>>>processors
>>>>>>>>>>>>If I try to then submit a job directly (with no wildcard) to 
>>>>>>>>>>>>the
>>>>>>>>>>>>PE
>>>>>>>>>>>>that
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>all of the jobs were assigned to, it will not run because I 
>>>>>>>>>>>>have
>>>>>>>>>>>>already
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>far exceeded the slots limit for this PE.
>>>>>>>>>>>>I should note that when I do not use wildcards, everything 
>>>>>>>>>>>>behaves
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>as
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>it
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>should.  E.g, a job submitted to mymachine.2.mpi will be 
>>>>>>>>>>>>assigned
>>>>>>>>>>>>to
>>>>>>>>>>>>mymachine.2.mpi and mymachine.2.q, and I cannot use more than 
>>>>>>>>>>>>16
>>>>>>>>>>>>slots
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>in
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>mymachine.2.mpi at once.
>>>>>>>>>>>>I searched the list, and although there seem to have been 
>>>>>>>>>>>>other
>>>>>>>>>>>>problems
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>>with wildcards in the past, I have seen nothing that
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>
>>>references
>>>   
>>>
>>>
>>>>>>>>>>>>this
>>>>>>>>>>>>behavior.  Does anyone have an explanation / workaround?
>>>>>>>>>>>>Tim
>>>>>>>>>>>>                     
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>
>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>         
>>>>>>
>>>>>>
>>>>>>>>>>>To unsubscribe, e-mail: 
>>>>>>>>>>>users-unsubscribe at gridengine.sunsource.net
>>>>>>>>>>>For additional commands, e-mail:
>>>>>>>>>>>                   
>>>>>>>>>>>
>>>>>>
>>>>>>users-help at gridengine.sunsource.net
>>>>>>         
>>>>>>
>>>>>>
>>>>>>>>>>>                   
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                 
>>>>>>>>>>
>>>
>>>---------------------------------------------------------------------
>>>   
>>>
>>>
>>>>>>>>>>To unsubscribe, e-mail:
>>>>>>>>>>                 
>>>>>>>>>>
>>>
>>>users-unsubscribe at gridengine.sunsource.net
>>>   
>>>
>>>
>>>>>>>>>>For additional commands, e-mail: 
>>>>>>>>>>users-help at gridengine.sunsource.net
>>>>>>>>>>
>>>>>>>>>>                 
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>
>>>
>>>---------------------------------------------------------------------
>>>   
>>>
>>>
>>>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>>For additional commands, e-mail: 
>>>>>>>>>users-help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>               
>>>>>>>>>
>>>>>>>>
>>>>>>>>             
>>>>>>>>
>>>
>>>---------------------------------------------------------------------
>>>   
>>>
>>>
>>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>>             
>>>>>>>>
>>>>>>>
>>>>>>>---------------------------------------------------------------------
>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>>
>>>>>>>           
>>>>>>>
>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>>         
>>>>>>
>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>       
>>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>     
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>   
>>>
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>> 
>>
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list