[GE users] Wildcards in PE still broken in 6.0u3

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon May 2 13:53:46 BST 2005


Hi Reuti,

I will test it again. But your setup is very similar to mine. I just tested
u4. I will set it up your way and test u3 and u4.

Cheer,
Stephan

Reuti wrote:

>Hi Stephan,
>
>the problem is not the range of slots, which is selected for a parallel 
>job, but it is a selected queue/PE mismatch. In 6.0u1 all is working 
>fine (as I observed), but in 6.0u3 you get this behavior.
>
>You have four PEs like:
>
>$ qconf -sp mymachine.0.mpi
>pe_name           mymachine.0.mpi
>slots             16
>user_lists        NONE
>xuser_lists       NONE
>start_proc_args   /bin/true
>stop_proc_args    /bin/true
>allocation_rule   $round_robin
>control_slaves    FALSE
>job_is_first_task TRUE
>urgency_slots     min
>
>and for "mymachine.1.mpi", "mymachine.2.mpi", "mymachine.3.mpi" similar.
>
>Then attach one PE to one queue like (and for {1,2,3} same):
>
>$ qconf -sq mymachine.q.0
>qname                 mymachine.q.0
>hostlist              @mymachine-0
>...
>pe_list               mymachine.0.mpi
>...
>
>The @mymachine-0 is:
>
>$ qconf -shgrp @mymachine-0
>group_name @mymachine-0
>hostlist ic001 ic002 ic003 ic004 ic005 ic006 ic007 ic008
>$ qconf -shgrp @mymachine-1
>group_name @mymachine-1
>hostlist ic009 ic010 ic011 ic012 ic013 ic014 ic015 ic016
>
>and so on for 2 and 3.
>
>With this setup, you can force a job to stay in "@mymachine-0" group by 
>using "-pe mymachine.*.mpi".
>
>Now I submit a parallel job:
>
>$ qsub -pe mymachine.*.mpi 4 waiter.sh
>Your job 458 ("waiter.sh") has been submitted.
>$ qstat -r
>job-ID  prior   name       user         state submit/start at     queue 
>                          slots ja-task-ID
>-----------------------------------------------------------------------------------------------------------------
>     458 0.55500 waiter.sh  reuti        r     05/02/2005 12:12:37 
>mymachine.q.0 at ic004                4
>        Full jobname:     waiter.sh
>        Master queue:     mymachine.q.0 at ic004
>        Requested PE:     mymachine.*.mpi 4
>        Granted PE:       mymachine.3.mpi 4
>        Hard Resources:
>        Soft Resources:
>
>And already here the problem can be seen: it's running in the queue 
>"mymachine.q.0" with master host "mymachine.q.0 at ic004". But the granted 
>PE is "mymachine.3.mpi". This makes no sense, as the attached PE to 
>"mymachine.q.0" is "mymachine.0.mpi" - not "mymachine.3.mpi".
>
>The "mymachine.3.mpi" is only attached to "mymachine.q.3".
>
>The counting for used slots is really all going really to the PE 
>"mymachine.3.mpi". With 16 slots granted there will be already a count 
>of -32 (16 -  3 * 16 (mymachine.{0,1,2}.mpi)) for "mymachine.3.mpi". So 
>the real "mymachine.3.mpi" will never get a job. Workaround: give 
>"mymachine.3.mpi" 999 slots.
>
>
>Cheers - Reuti
>
>
>Stephan Grell - Sun Germany - SSG - Software Engineer wrote:
>  
>
>>Hi,
>>
>>I tried to replicate the bug and could not. I used 6.0u4. From what I
>>get out
>>of the description, it is bug 1216, which was found in 6.0u1 and fixed
>>in 6.0u2.
>>
>>I do not get the current version discussion. It would also be good to
>>have the qsub
>>line and the pe config.
>>
>>Cheers,
>>Stephan
>>
>>Reuti wrote:
>>
>>
>>    
>>
>>>Hi Tim,
>>>
>>>I could reproduce the weird behavior. Can you please file a bug? As I found, in 
>>>6.0u1 it was still working, it must be introduced in one of the following 
>>>releases. Seems that there is now also an order in taking the slots for the PEs 
>>>- the one from mymachine.q.0 are taken first, then the ones from mymachine.q.1 
>>>..
>>>
>>>Cheers - Reuti
>>>
>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>
>>>
>>>
>>>
>>>      
>>>
>>>>Some more information...  If I run qstat -j 61, I get the output below.
>>>>
>>>>Tim
>>>>
>>>>....................
>>>>JOB INFO CUT
>>>>......................
>>>>script_file:                Job7
>>>>parallel environment:  *.mpi range: 8
>>>>scheduling info:            queue instance "mymachine.q.0 at local0" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.0 at local1" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.0 at local2" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.0 at local3" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.0 at local4" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.0 at local5" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.0 at local6" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.0 at local7" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.1 at local10" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.1 at local11" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.1 at local12" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.1 at local13" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.1 at local14" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.1 at local15" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.1 at local8" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.1 at local9" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.2 at local16" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.2 at local17" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.2 at local18" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.2 at local19" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.2 at local20" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.2 at local21" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.2 at local22" dropped 
>>>>because it is full
>>>>                          queue instance "mymachine.q.2 at local23" dropped 
>>>>because it is full
>>>>                          cannot run in queue instance 
>>>>"mymachine.q.3 at local30" because PE "mymachine.2.mpi" is not in pe list
>>>>                          cannot run in queue instance 
>>>>"mymachine.q.3 at local26" because PE "mymachine.2.mpi" is not in pe list
>>>>                          cannot run in queue instance 
>>>>"mymachine.q.3 at local25" because PE "mymachine.2.mpi" is not in pe list
>>>>                          cannot run in queue instance 
>>>>"mymachine.q.3 at local24" because PE "mymachine.2.mpi" is not in pe list
>>>>                          cannot run in queue instance 
>>>>"mymachine.q.3 at local27" because PE "mymachine.2.mpi" is not in pe list
>>>>                          cannot run in queue instance 
>>>>"mymachine.q.3 at local28" because PE "mymachine.2.mpi" is not in pe list
>>>>                          cannot run in queue instance 
>>>>"mymachine.q.3 at local29" because PE "mymachine.2.mpi" is not in pe list
>>>>                          cannot run in queue instance 
>>>>"mymachine.q.3 at local31" because PE "mymachine.2.mpi" is not in pe list
>>>>                          cannot run because resources requested are not 
>>>>available for parallel job
>>>>                          cannot run because available slots combined 
>>>>under PE "mymachine.2.mpi" are not in range of job
>>>>                          cannot run because available slots combined 
>>>>under PE "mymachine.3.mpi" are not in range of job
>>>>
>>>>----- Original Message ----- 
>>>>From: "Tim Mueller" <tim_mueller at hotmail.com>
>>>>To: <users at gridengine.sunsource.net>
>>>>Sent: Friday, April 29, 2005 2:11 PM
>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>
>>>>
>>>>  
>>>>
>>>>
>>>>        
>>>>
>>>>>That's what I had hoped initially.  However, it does not explain why no 
>>>>>jobs get assigned to mymachine.q.3, which is the only queue to which they
>>>>>    
>>>>>
>>>>>should get assigned.  It appears that jobs get rejected from this queue 
>>>>>because the scheduler believes mymachine.3.mpi is too full.
>>>>>
>>>>>qstat -g t gives the following:
>>>>>
>>>>>job-ID  prior   name       user         state submit/start at     queue 
>>>>>master ja-task-ID
>>>>>
>>>>>    
>>>>>
>>>>>          
>>>>>
>>>-------------------------------------------------------------------------------
>>>-----------------------------------
>>>
>>>
>>>
>>>      
>>>
>>>>>  47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>>mymachine.q.0 at local0                 SLAVE
>>>>>  59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>>mymachine.q.0 at local0                 MASTER
>>>>>
>>>>>mymachine.q.0 at local0 SLAVE
>>>>>  47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>>mymachine.q.0 at local1                 SLAVE
>>>>>  59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>>mymachine.q.0 at local1                 SLAVE
>>>>>  47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>>mymachine.q.0 at local2                 SLAVE
>>>>>  59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>>mymachine.q.0 at local2                 SLAVE
>>>>>  47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>>mymachine.q.0 at local3                 SLAVE
>>>>>  59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>>mymachine.q.0 at local3                 SLAVE
>>>>>  47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>>mymachine.q.0 at local4                 SLAVE
>>>>>  59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>>mymachine.q.0 at local4                 SLAVE
>>>>>  47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>>mymachine.q.0 at local5                 SLAVE
>>>>>  59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>>mymachine.q.0 at local5                 SLAVE
>>>>>  47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>>mymachine.q.0 at local6                 MASTER
>>>>>
>>>>>mymachine.q.0 at local6 SLAVE
>>>>>  59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>>mymachine.q.0 at local6                 SLAVE
>>>>>  47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>>>mymachine.q.0 at local7                 SLAVE
>>>>>  59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>>>mymachine.q.0 at local7                 SLAVE
>>>>>  44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>>mymachine.q.1 at local10                SLAVE
>>>>>  60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>>mymachine.q.1 at local10                SLAVE
>>>>>  44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>>mymachine.q.1 at local11                SLAVE
>>>>>  60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>>mymachine.q.1 at local11                SLAVE
>>>>>  44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>>mymachine.q.1 at local12                MASTER
>>>>>
>>>>>mymachine.q.1 at local12 SLAVE
>>>>>  60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>>mymachine.q.1 at local12                SLAVE
>>>>>  44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>>mymachine.q.1 at local13                SLAVE
>>>>>  60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>>mymachine.q.1 at local13                SLAVE
>>>>>  44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>>mymachine.q.1 at local14                SLAVE
>>>>>  60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>>mymachine.q.1 at local14                SLAVE
>>>>>  44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>>mymachine.q.1 at local15                SLAVE
>>>>>  60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>>mymachine.q.1 at local15                SLAVE
>>>>>  44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>>mymachine.q.1 at local8                 SLAVE
>>>>>  60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>>mymachine.q.1 at local8                 SLAVE
>>>>>  44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>>>mymachine.q.1 at local9                 SLAVE
>>>>>  60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>>>mymachine.q.1 at local9                 MASTER
>>>>>
>>>>>mymachine.q.1 at local9 SLAVE
>>>>>  48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>>mymachine.q.2 at local16                SLAVE
>>>>>  49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>>mymachine.q.2 at local16                MASTER
>>>>>
>>>>>mymachine.q.2 at local16 SLAVE
>>>>>  48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>>mymachine.q.2 at local17                SLAVE
>>>>>  49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>>mymachine.q.2 at local17                SLAVE
>>>>>  48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>>mymachine.q.2 at local18                SLAVE
>>>>>  49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>>mymachine.q.2 at local18                SLAVE
>>>>>  48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>>mymachine.q.2 at local19                SLAVE
>>>>>  49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>>mymachine.q.2 at local19                SLAVE
>>>>>  48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>>mymachine.q.2 at local20                MASTER
>>>>>
>>>>>mymachine.q.2 at local20 SLAVE
>>>>>  49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>>mymachine.q.2 at local20                SLAVE
>>>>>  48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>>mymachine.q.2 at local21                SLAVE
>>>>>  49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>>mymachine.q.2 at local21                SLAVE
>>>>>  48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>>mymachine.q.2 at local22                SLAVE
>>>>>  49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>>mymachine.q.2 at local22                SLAVE
>>>>>  48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>>>mymachine.q.2 at local23                SLAVE
>>>>>  49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>>>mymachine.q.2 at local23                SLAVE
>>>>>  61 0.55500 Job7        user        qw    04/29/2005 11:19:54
>>>>>
>>>>>Tim
>>>>>
>>>>>----- Original Message ----- 
>>>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>>>To: <users at gridengine.sunsource.net>
>>>>>Sent: Friday, April 29, 2005 1:24 PM
>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>
>>>>>
>>>>>    
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>>>Aha Tim,
>>>>>>
>>>>>>now I understand your setup. As the naming of the masterq, e.g.
>>>>>>mymachine.q.1 at local9 is inside your intended configuration, what shows
>>>>>>
>>>>>>qstat -g t
>>>>>>
>>>>>>Maybe the output of the granted PE is just wrong, but all is working as
>>>>>>intended? - Reuti
>>>>>>
>>>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>>>
>>>>>>      
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>There are 32 machines, each dual-processor with names
>>>>>>>
>>>>>>>local0
>>>>>>>local1
>>>>>>>..
>>>>>>>local31
>>>>>>>
>>>>>>>They are grouped together with four 8-port gigabit switches.  Each group
>>>>>>>        
>>>>>>>
>>>>>>>was
>>>>>>>
>>>>>>>given a queue, a PE, and a hostgroup.  So for example @mymachine-0 
>>>>>>>contains
>>>>>>>
>>>>>>>local0
>>>>>>>local1
>>>>>>>..
>>>>>>>local7
>>>>>>>
>>>>>>>local0-local7 are all connected via both the central cluster switch and
>>>>>>>        
>>>>>>>
>>>>>>>a
>>>>>>>local gigabit switch.
>>>>>>>
>>>>>>>I should also note that I am using hostname aliasing to ensure that the
>>>>>>>ethernet interface connected to the gigabit switch is used by Grid 
>>>>>>>Engine.
>>>>>>>So I have a file host_aliases file set up as follows:
>>>>>>>
>>>>>>>local0 node0
>>>>>>>local1 node1
>>>>>>>..
>>>>>>>local31 node31
>>>>>>>
>>>>>>>Where "nodeX" is the primary hostname for each machine and resolves to 
>>>>>>>the
>>>>>>>interface that connets to the central cluster switch.  "localX" resolves
>>>>>>>        
>>>>>>>
>>>>>>>to
>>>>>>>
>>>>>>>an address that connects via the gigabit interface if possible.  The
>>>>>>>"localX" names do not resolve consistently across the cluster -- for 
>>>>>>>example
>>>>>>>
>>>>>>>if I am on node0 and I ping local1, it will do so over the gigabit
>>>>>>>interface.  However if I am on node31 and I ping local1, it will do so 
>>>>>>>over
>>>>>>>
>>>>>>>the non-gigabit interface, because there is no gigabit connection 
>>>>>>>between
>>>>>>>node31 and node1.
>>>>>>>
>>>>>>>Tim
>>>>>>>
>>>>>>>----- Original Message ----- 
>>>>>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>>>>>To: <users at gridengine.sunsource.net>
>>>>>>>Sent: Friday, April 29, 2005 12:15 PM
>>>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>>>
>>>>>>>
>>>>>>>        
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>Tim,
>>>>>>>>
>>>>>>>>thanks, but I'm still not sure about your setup. You stated that you 
>>>>>>>>have
>>>>>>>>          
>>>>>>>>
>>>>>>>>32
>>>>>>>>dual machines. So you made a hostgroup @mymachine-0 with which 
>>>>>>>>machines
>>>>>>>>setup
>>>>>>>>therein? - And why so many queues at all?
>>>>>>>>
>>>>>>>>CU - Reuti
>>>>>>>>
>>>>>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>>>>>
>>>>>>>>          
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>Hi,
>>>>>>>>>
>>>>>>>>>I get:
>>>>>>>>>
>>>>>>>>>   59 0.55500 Job1    user        r     04/29/2005 10:47:08
>>>>>>>>>mymachine.q.0 at local0                     8
>>>>>>>>>     Full jobname:     Job1
>>>>>>>>>     Master queue:     mymachine.q.0 at local0
>>>>>>>>>     Requested PE:     *.mpi 8
>>>>>>>>>     Granted PE:       mymachine.3.mpi 8
>>>>>>>>>     Hard Resources:
>>>>>>>>>     Soft Resources:
>>>>>>>>>   47 0.55500 Job2    user        r     04/27/2005 14:45:04
>>>>>>>>>mymachine.q.0 at local6                     8
>>>>>>>>>     Full jobname:     Job2
>>>>>>>>>     Master queue:     mymachine.q.0 at local6
>>>>>>>>>     Requested PE:     *.mpi 8
>>>>>>>>>     Granted PE:       mymachine.3.mpi 8
>>>>>>>>>     Hard Resources:
>>>>>>>>>     Soft Resources:
>>>>>>>>>   44 0.55500 Job3    user        r     04/27/2005 11:55:49
>>>>>>>>>mymachineq.1 at local12                    8
>>>>>>>>>     Full jobname:     Job3
>>>>>>>>>     Master queue:     mymachine.q.1 at local12
>>>>>>>>>     Requested PE:     *.mpi 8
>>>>>>>>>     Granted PE:       mymachine.3.mpi 8
>>>>>>>>>     Hard Resources:
>>>>>>>>>     Soft Resources:
>>>>>>>>>   60 0.55500 Job4    user        r     04/29/2005 10:55:53
>>>>>>>>>mymachine.q.1 at local9                     8
>>>>>>>>>     Full jobname:     Job4
>>>>>>>>>     Master queue:     mymachine.q.1 at local9
>>>>>>>>>     Requested PE:     *.mpi 8
>>>>>>>>>     Granted PE:       mymachine.3.mpi 8
>>>>>>>>>     Hard Resources:
>>>>>>>>>     Soft Resources:
>>>>>>>>>   49 0.55500 Job5    user        r     04/27/2005 15:01:53
>>>>>>>>>mymachine.q.2 at local16                    8
>>>>>>>>>     Full jobname:     Job5
>>>>>>>>>     Master queue:     mymachine.q.2 at local16
>>>>>>>>>     Requested PE:     *.mpi 8
>>>>>>>>>     Granted PE:       mymachine.3.mpi 8
>>>>>>>>>     Hard Resources:
>>>>>>>>>     Soft Resources:
>>>>>>>>>   48 0.55500 Job6    user        r     04/27/2005 14:57:53
>>>>>>>>>mymachine.q.2 at local20                    8
>>>>>>>>>     Full jobname:     Job6
>>>>>>>>>     Master queue:     mymachine.q.2 at local20
>>>>>>>>>     Requested PE:     *.mpi 8
>>>>>>>>>     Granted PE:       mymachine.3.mpi 8
>>>>>>>>>     Hard Resources:
>>>>>>>>>     Soft Resources:
>>>>>>>>>   61 0.55500 Job7    user        r    04/29/2005 11:19:54
>>>>>>>>>8
>>>>>>>>>     Full jobname:     Job7
>>>>>>>>>     Requested PE:     *.mpi 8
>>>>>>>>>     Hard Resources:
>>>>>>>>>     Soft Resources:
>>>>>>>>>
>>>>>>>>>When I do qconf -sp mymachine.3.mpi, I get:
>>>>>>>>>
>>>>>>>>>pe_name           mymachine.3.mpi
>>>>>>>>>slots             16
>>>>>>>>>user_lists        NONE
>>>>>>>>>xuser_lists       NONE
>>>>>>>>>start_proc_args   /bin/true
>>>>>>>>>stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
>>>>>>>>>allocation_rule   $round_robin
>>>>>>>>>control_slaves    TRUE
>>>>>>>>>job_is_first_task FALSE
>>>>>>>>>urgency_slots     avg
>>>>>>>>>
>>>>>>>>>When I do qconf -sq mymachine.q.0, I get
>>>>>>>>>
>>>>>>>>>qname                 mymachine.q.0
>>>>>>>>>hostlist              @mymachine-0
>>>>>>>>>seq_no                0
>>>>>>>>>load_thresholds       NONE
>>>>>>>>>suspend_thresholds    NONE
>>>>>>>>>nsuspend              1
>>>>>>>>>suspend_interval      00:05:00
>>>>>>>>>priority              0
>>>>>>>>>min_cpu_interval      00:05:00
>>>>>>>>>processors            UNDEFINED
>>>>>>>>>qtype                 BATCH INTERACTIVE
>>>>>>>>>ckpt_list             NONE
>>>>>>>>>pe_list               mymachine.0.mpi
>>>>>>>>>rerun                 FALSE
>>>>>>>>>slots                 2
>>>>>>>>>tmpdir                /tmp
>>>>>>>>>shell                 /bin/bash
>>>>>>>>>prolog                NONE
>>>>>>>>>epilog                NONE
>>>>>>>>>shell_start_mode      posix_compliant
>>>>>>>>>starter_method        NONE
>>>>>>>>>suspend_method        NONE
>>>>>>>>>resume_method         NONE
>>>>>>>>>terminate_method      NONE
>>>>>>>>>notify                00:00:60
>>>>>>>>>owner_list            sgeadmin
>>>>>>>>>user_lists            NONE
>>>>>>>>>xuser_lists           NONE
>>>>>>>>>subordinate_list      NONE
>>>>>>>>>complex_values        NONE
>>>>>>>>>projects              NONE
>>>>>>>>>xprojects             NONE
>>>>>>>>>calendar              NONE
>>>>>>>>>initial_state         default
>>>>>>>>>s_rt                  84:00:00
>>>>>>>>>h_rt                  84:15:00
>>>>>>>>>s_cpu                 INFINITY
>>>>>>>>>h_cpu                 INFINITY
>>>>>>>>>s_fsize               INFINITY
>>>>>>>>>h_fsize               INFINITY
>>>>>>>>>s_data                INFINITY
>>>>>>>>>h_data                INFINITY
>>>>>>>>>s_stack               INFINITY
>>>>>>>>>h_stack               INFINITY
>>>>>>>>>s_core                INFINITY
>>>>>>>>>h_core                INFINITY
>>>>>>>>>s_rss                 1G
>>>>>>>>>h_rss                 1G
>>>>>>>>>s_vmem                INFINITY
>>>>>>>>>h_vmem                INFINITY
>>>>>>>>>
>>>>>>>>>And so on, up to mymachine.q.3.
>>>>>>>>>
>>>>>>>>>Tim
>>>>>>>>>
>>>>>>>>>----- Original Message ----- 
>>>>>>>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>>>>>>>To: <users at gridengine.sunsource.net>
>>>>>>>>>Sent: Friday, April 29, 2005 11:14 AM
>>>>>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>            
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>Hi Tim,
>>>>>>>>>>
>>>>>>>>>>what is:
>>>>>>>>>>
>>>>>>>>>>qstat -r
>>>>>>>>>>
>>>>>>>>>>showing as granted PEs? - Reuti
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>>>>>>>
>>>>>>>>>>              
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>>>Hi,
>>>>>>>>>>>
>>>>>>>>>>>That's the problem.  The setup is actually
>>>>>>>>>>>
>>>>>>>>>>>mymachine.q.0 references mymachine.0.mpi
>>>>>>>>>>>mymachine.q.1 references mymachine.1.mpi
>>>>>>>>>>>mymachine.q.2 references mymachine.2.mpi
>>>>>>>>>>>mymachine.q.3 references mymachine.3.mpi
>>>>>>>>>>>
>>>>>>>>>>>There is no reason, as far as I can tell, that a job could ever be
>>>>>>>>>>>                
>>>>>>>>>>>
>>>>>>>>>>>in
>>>>>>>>>>>both
>>>>>>>>>>>mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I use
>>>>>>>>>>>wildcards,
>>>>>>>>>>>
>>>>>>>>>>>the the scheduler won't put a job assigned to mymachine.3.mpi
>>>>>>>>>>>                
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>into
>>>>  
>>>>
>>>>
>>>>        
>>>>
>>>>>>>>>>>mymachine.q.3 until all of the other queues are full.  At that 
>>>>>>>>>>>point,
>>>>>>>>>>>it's
>>>>>>>>>>>too late because mymachine.3.mpi is using 48 slots, when it's
>>>>>>>>>>>                
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>only
>>>>  
>>>>
>>>>
>>>>        
>>>>
>>>>>>>>>>>allowed
>>>>>>>>>>>to use up to 16.
>>>>>>>>>>>
>>>>>>>>>>>When I don't use wildcards, I get the behavior I expect:  A job
>>>>>>>>>>>submitted
>>>>>>>>>>>                
>>>>>>>>>>>
>>>>>>>>>>>to
>>>>>>>>>>>
>>>>>>>>>>>mymachine.3.mpi gets put in mymachine.q.3, etc.
>>>>>>>>>>>
>>>>>>>>>>>Tim
>>>>>>>>>>>
>>>>>>>>>>>----- Original Message ----- 
>>>>>>>>>>>From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
>>>>>>>>>>><stephan.grell at sun.com>
>>>>>>>>>>>To: <users at gridengine.sunsource.net>
>>>>>>>>>>>Sent: Friday, April 29, 2005 2:34 AM
>>>>>>>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>>Hi Tim,
>>>>>>>>>>>>
>>>>>>>>>>>>I am not quite sure I understand your setup. Could you please
>>>>>>>>>>>>                  
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>>>
>>>>>>>attach
>>>>>>>        
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>>>>your
>>>>>>>>>>>>                  
>>>>>>>>>>>>
>>>>>>>>>>>>cqueue configuration? From
>>>>>>>>>>>>the results you posted, it reads as if:
>>>>>>>>>>>>queue
>>>>>>>>>>>>mymachine.q.0  references mymachine.3.mpi
>>>>>>>>>>>>mymachine.q.1  reference mymachine.3.mpi
>>>>>>>>>>>>
>>>>>>>>>>>>and so on.
>>>>>>>>>>>>
>>>>>>>>>>>>Cheers,
>>>>>>>>>>>>Stephan
>>>>>>>>>>>>
>>>>>>>>>>>>Tim Mueller wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>                  
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>>>
>>>>>>>>>>>>>Hi,
>>>>>>>>>>>>>It appears that wildcards in the Parallel Environment name 
>>>>>>>>>>>>>still
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>>>have
>>>>>>>>>            
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>>>>problems in 6.0u3.  I have set up a linux cluster of 32 dual
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>>>processor
>>>>>>>>>            
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>>>>Noconas running Linux.  There are 4 queues of 16 processors 
>>>>>>>>>>>>>each,
>>>>>>>>>>>>>and
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>a
>>>>>>>>>>>>>corresponding pe for each queue.  The queues are named as 
>>>>>>>>>>>>>follows:
>>>>>>>>>>>>>mymachine.q.0
>>>>>>>>>>>>>mymachine.q.1
>>>>>>>>>>>>>mymachine.q.2
>>>>>>>>>>>>>mymachine.q.3
>>>>>>>>>>>>>And the PE's are
>>>>>>>>>>>>>mymachine.0.mpi
>>>>>>>>>>>>>mymachine.1.mpi
>>>>>>>>>>>>>mymachine.2.mpi
>>>>>>>>>>>>>mymachine.3.mpi
>>>>>>>>>>>>>All of the PE's have 16 slots.  When I submit a job with the
>>>>>>>>>>>>>following
>>>>>>>>>>>>>line:
>>>>>>>>>>>>>#$ -pe *.mpi 8
>>>>>>>>>>>>>the job will be assigned to a seemingly random PE, but then 
>>>>>>>>>>>>>placed
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>in
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>a
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>queue that does not correspond to that PE.  I can submit up to
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>6
>>>>>>>>>>>>>jobs
>>>>>>>>>>>>>this way, each of which will get assigned to the same PE and
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>placed
>>>>>>>        
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>in
>>>>>>>>>            
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>>>>any queue that does not correspond to the PE.  This causes 48
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>to be used for a PE with only 16 slots.  E.g., I might get:
>>>>>>>>>>>>>Job 1        mymachine.3.mpi        mymachine.q.0        8
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>>>processors
>>>>>>>>>            
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>>>>Job 2        mymachine.3.mpi        mymachine.q.0        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 3        mymachine.3.mpi        mymachine.q.1        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 4        mymachine.3.mpi        mymachine.q.1        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 5        mymachine.3.mpi        mymachine.q.2        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 6        mymachine.3.mpi        mymachine.q.2        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 7        qw
>>>>>>>>>>>>>Job 8        qw
>>>>>>>>>>>>>When I should get:
>>>>>>>>>>>>>Job 1        mymachine.0.mpi        mymachine.q.0        8
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>>>processors
>>>>>>>>>            
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>>>>Job 2        mymachine.0.mpi        mymachine.q.0        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 3        mymachine.1.mpi        mymachine.q.1        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 4        mymachine.1.mpi        mymachine.q.1        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 5        mymachine.2.mpi        mymachine.q.2        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 6        mymachine.2.mpi        mymachine.q.2        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 5        mymachine.3.mpi        mymachine.q.3        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>Job 6        mymachine.3.mpi        mymachine.q.3        8
>>>>>>>>>>>>>processors
>>>>>>>>>>>>>If I try to then submit a job directly (with no wildcard) to 
>>>>>>>>>>>>>the
>>>>>>>>>>>>>PE
>>>>>>>>>>>>>that
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>all of the jobs were assigned to, it will not run because I 
>>>>>>>>>>>>>have
>>>>>>>>>>>>>already
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>far exceeded the slots limit for this PE.
>>>>>>>>>>>>>I should note that when I do not use wildcards, everything 
>>>>>>>>>>>>>behaves
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>as
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>it
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>should.  E.g, a job submitted to mymachine.2.mpi will be 
>>>>>>>>>>>>>assigned
>>>>>>>>>>>>>to
>>>>>>>>>>>>>mymachine.2.mpi and mymachine.2.q, and I cannot use more than 
>>>>>>>>>>>>>16
>>>>>>>>>>>>>slots
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>in
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>mymachine.2.mpi at once.
>>>>>>>>>>>>>I searched the list, and although there seem to have been 
>>>>>>>>>>>>>other
>>>>>>>>>>>>>problems
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>with wildcards in the past, I have seen nothing that
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>references
>>>>  
>>>>
>>>>
>>>>        
>>>>
>>>>>>>>>>>>>this
>>>>>>>>>>>>>behavior.  Does anyone have an explanation / workaround?
>>>>>>>>>>>>>Tim
>>>>>>>>>>>>>                    
>>>>>>>>>>>>>
>>>>>>>>>>>>>                          
>>>>>>>>>>>>>
>>>>>>>>>>>>                  
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>>>
>>>>>>>---------------------------------------------------------------------
>>>>>>>        
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>>>>To unsubscribe, e-mail: 
>>>>>>>>>>>>users-unsubscribe at gridengine.sunsource.net
>>>>>>>>>>>>For additional commands, e-mail:
>>>>>>>>>>>>                  
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>>>
>>>>>>>users-help at gridengine.sunsource.net
>>>>>>>        
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>>>>>                  
>>>>>>>>>>>>
>>>>>>>>>>>>                        
>>>>>>>>>>>>
>>>>>>>>>>>                
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>---------------------------------------------------------------------
>>>>  
>>>>
>>>>
>>>>        
>>>>
>>>>>>>>>>>To unsubscribe, e-mail:
>>>>>>>>>>>                
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>users-unsubscribe at gridengine.sunsource.net
>>>>  
>>>>
>>>>
>>>>        
>>>>
>>>>>>>>>>>For additional commands, e-mail: 
>>>>>>>>>>>users-help at gridengine.sunsource.net
>>>>>>>>>>>
>>>>>>>>>>>                
>>>>>>>>>>>
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>              
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>---------------------------------------------------------------------
>>>>  
>>>>
>>>>
>>>>        
>>>>
>>>>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>>>For additional commands, e-mail: 
>>>>>>>>>>users-help at gridengine.sunsource.net
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>              
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>            
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>---------------------------------------------------------------------
>>>>  
>>>>
>>>>
>>>>        
>>>>
>>>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>>            
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>---------------------------------------------------------------------
>>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>>
>>>>>>>>          
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>---------------------------------------------------------------------
>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>>        
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>>      
>>>>>>
>>>>>>            
>>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>
>>>>>    
>>>>>
>>>>>          
>>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>  
>>>>
>>>>        
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list