[GE users] Wildcards in PE still broken in 6.0u3

Stephan Grell - Sun Germany - SSG - Software Engineer stephan.grell at sun.com
Mon May 2 10:59:10 BST 2005


Hi,

I tried to replicate the bug and could not. I used 6.0u4. From what I
get out
of the description, it is bug 1216, which was found in 6.0u1 and fixed
in 6.0u2.

I do not get the current version discussion. It would also be good to
have the qsub
line and the pe config.

Cheers,
Stephan

Reuti wrote:

>Hi Tim,
>
>I could reproduce the weird behavior. Can you please file a bug? As I found, in 
>6.0u1 it was still working, it must be introduced in one of the following 
>releases. Seems that there is now also an order in taking the slots for the PEs 
>- the one from mymachine.q.0 are taken first, then the ones from mymachine.q.1 
>..
>
>Cheers - Reuti
>
>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>
>  
>
>>Some more information...  If I run qstat -j 61, I get the output below.
>>
>>Tim
>>
>>....................
>>JOB INFO CUT
>>......................
>>script_file:                Job7
>>parallel environment:  *.mpi range: 8
>>scheduling info:            queue instance "mymachine.q.0 at local0" dropped 
>>because it is full
>>                            queue instance "mymachine.q.0 at local1" dropped 
>>because it is full
>>                            queue instance "mymachine.q.0 at local2" dropped 
>>because it is full
>>                            queue instance "mymachine.q.0 at local3" dropped 
>>because it is full
>>                            queue instance "mymachine.q.0 at local4" dropped 
>>because it is full
>>                            queue instance "mymachine.q.0 at local5" dropped 
>>because it is full
>>                            queue instance "mymachine.q.0 at local6" dropped 
>>because it is full
>>                            queue instance "mymachine.q.0 at local7" dropped 
>>because it is full
>>                            queue instance "mymachine.q.1 at local10" dropped 
>>because it is full
>>                            queue instance "mymachine.q.1 at local11" dropped 
>>because it is full
>>                            queue instance "mymachine.q.1 at local12" dropped 
>>because it is full
>>                            queue instance "mymachine.q.1 at local13" dropped 
>>because it is full
>>                            queue instance "mymachine.q.1 at local14" dropped 
>>because it is full
>>                            queue instance "mymachine.q.1 at local15" dropped 
>>because it is full
>>                            queue instance "mymachine.q.1 at local8" dropped 
>>because it is full
>>                            queue instance "mymachine.q.1 at local9" dropped 
>>because it is full
>>                            queue instance "mymachine.q.2 at local16" dropped 
>>because it is full
>>                            queue instance "mymachine.q.2 at local17" dropped 
>>because it is full
>>                            queue instance "mymachine.q.2 at local18" dropped 
>>because it is full
>>                            queue instance "mymachine.q.2 at local19" dropped 
>>because it is full
>>                            queue instance "mymachine.q.2 at local20" dropped 
>>because it is full
>>                            queue instance "mymachine.q.2 at local21" dropped 
>>because it is full
>>                            queue instance "mymachine.q.2 at local22" dropped 
>>because it is full
>>                            queue instance "mymachine.q.2 at local23" dropped 
>>because it is full
>>                            cannot run in queue instance 
>>"mymachine.q.3 at local30" because PE "mymachine.2.mpi" is not in pe list
>>                            cannot run in queue instance 
>>"mymachine.q.3 at local26" because PE "mymachine.2.mpi" is not in pe list
>>                            cannot run in queue instance 
>>"mymachine.q.3 at local25" because PE "mymachine.2.mpi" is not in pe list
>>                            cannot run in queue instance 
>>"mymachine.q.3 at local24" because PE "mymachine.2.mpi" is not in pe list
>>                            cannot run in queue instance 
>>"mymachine.q.3 at local27" because PE "mymachine.2.mpi" is not in pe list
>>                            cannot run in queue instance 
>>"mymachine.q.3 at local28" because PE "mymachine.2.mpi" is not in pe list
>>                            cannot run in queue instance 
>>"mymachine.q.3 at local29" because PE "mymachine.2.mpi" is not in pe list
>>                            cannot run in queue instance 
>>"mymachine.q.3 at local31" because PE "mymachine.2.mpi" is not in pe list
>>                            cannot run because resources requested are not 
>>available for parallel job
>>                            cannot run because available slots combined 
>>under PE "mymachine.2.mpi" are not in range of job
>>                            cannot run because available slots combined 
>>under PE "mymachine.3.mpi" are not in range of job
>>
>>----- Original Message ----- 
>>From: "Tim Mueller" <tim_mueller at hotmail.com>
>>To: <users at gridengine.sunsource.net>
>>Sent: Friday, April 29, 2005 2:11 PM
>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>
>>
>>    
>>
>>>That's what I had hoped initially.  However, it does not explain why no 
>>>jobs get assigned to mymachine.q.3, which is the only queue to which they
>>>      
>>>
>>>should get assigned.  It appears that jobs get rejected from this queue 
>>>because the scheduler believes mymachine.3.mpi is too full.
>>>
>>>qstat -g t gives the following:
>>>
>>>job-ID  prior   name       user         state submit/start at     queue 
>>>master ja-task-ID
>>>
>>>      
>>>
>-------------------------------------------------------------------------------
>-----------------------------------
>  
>
>>>    47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>mymachine.q.0 at local0                 SLAVE
>>>    59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>mymachine.q.0 at local0                 MASTER
>>>
>>>mymachine.q.0 at local0 SLAVE
>>>    47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>mymachine.q.0 at local1                 SLAVE
>>>    59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>mymachine.q.0 at local1                 SLAVE
>>>    47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>mymachine.q.0 at local2                 SLAVE
>>>    59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>mymachine.q.0 at local2                 SLAVE
>>>    47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>mymachine.q.0 at local3                 SLAVE
>>>    59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>mymachine.q.0 at local3                 SLAVE
>>>    47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>mymachine.q.0 at local4                 SLAVE
>>>    59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>mymachine.q.0 at local4                 SLAVE
>>>    47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>mymachine.q.0 at local5                 SLAVE
>>>    59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>mymachine.q.0 at local5                 SLAVE
>>>    47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>mymachine.q.0 at local6                 MASTER
>>>
>>>mymachine.q.0 at local6 SLAVE
>>>    59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>mymachine.q.0 at local6                 SLAVE
>>>    47 0.55500 Job2        user        r     04/27/2005 14:45:04 
>>>mymachine.q.0 at local7                 SLAVE
>>>    59 0.55500 Job1        user        r     04/29/2005 10:47:08 
>>>mymachine.q.0 at local7                 SLAVE
>>>    44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>mymachine.q.1 at local10                SLAVE
>>>    60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>mymachine.q.1 at local10                SLAVE
>>>    44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>mymachine.q.1 at local11                SLAVE
>>>    60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>mymachine.q.1 at local11                SLAVE
>>>    44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>mymachine.q.1 at local12                MASTER
>>>
>>>mymachine.q.1 at local12 SLAVE
>>>    60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>mymachine.q.1 at local12                SLAVE
>>>    44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>mymachine.q.1 at local13                SLAVE
>>>    60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>mymachine.q.1 at local13                SLAVE
>>>    44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>mymachine.q.1 at local14                SLAVE
>>>    60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>mymachine.q.1 at local14                SLAVE
>>>    44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>mymachine.q.1 at local15                SLAVE
>>>    60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>mymachine.q.1 at local15                SLAVE
>>>    44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>mymachine.q.1 at local8                 SLAVE
>>>    60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>mymachine.q.1 at local8                 SLAVE
>>>    44 0.55500 Job3        user        r     04/27/2005 11:55:49 
>>>mymachine.q.1 at local9                 SLAVE
>>>    60 0.55500 Job4        user        r     04/29/2005 10:55:53 
>>>mymachine.q.1 at local9                 MASTER
>>>
>>>mymachine.q.1 at local9 SLAVE
>>>    48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>mymachine.q.2 at local16                SLAVE
>>>    49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>mymachine.q.2 at local16                MASTER
>>>
>>>mymachine.q.2 at local16 SLAVE
>>>    48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>mymachine.q.2 at local17                SLAVE
>>>    49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>mymachine.q.2 at local17                SLAVE
>>>    48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>mymachine.q.2 at local18                SLAVE
>>>    49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>mymachine.q.2 at local18                SLAVE
>>>    48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>mymachine.q.2 at local19                SLAVE
>>>    49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>mymachine.q.2 at local19                SLAVE
>>>    48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>mymachine.q.2 at local20                MASTER
>>>
>>>mymachine.q.2 at local20 SLAVE
>>>    49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>mymachine.q.2 at local20                SLAVE
>>>    48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>mymachine.q.2 at local21                SLAVE
>>>    49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>mymachine.q.2 at local21                SLAVE
>>>    48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>mymachine.q.2 at local22                SLAVE
>>>    49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>mymachine.q.2 at local22                SLAVE
>>>    48 0.55500 Job6        user        r     04/27/2005 14:57:53 
>>>mymachine.q.2 at local23                SLAVE
>>>    49 0.55500 Job5        user        r     04/27/2005 15:01:53 
>>>mymachine.q.2 at local23                SLAVE
>>>    61 0.55500 Job7        user        qw    04/29/2005 11:19:54
>>>
>>>Tim
>>>
>>>----- Original Message ----- 
>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>To: <users at gridengine.sunsource.net>
>>>Sent: Friday, April 29, 2005 1:24 PM
>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>
>>>
>>>      
>>>
>>>>Aha Tim,
>>>>
>>>>now I understand your setup. As the naming of the masterq, e.g.
>>>>mymachine.q.1 at local9 is inside your intended configuration, what shows
>>>>
>>>>qstat -g t
>>>>
>>>>Maybe the output of the granted PE is just wrong, but all is working as
>>>>intended? - Reuti
>>>>
>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>
>>>>        
>>>>
>>>>>There are 32 machines, each dual-processor with names
>>>>>
>>>>>local0
>>>>>local1
>>>>>..
>>>>>local31
>>>>>
>>>>>They are grouped together with four 8-port gigabit switches.  Each group
>>>>>          
>>>>>
>>>>>was
>>>>>
>>>>>given a queue, a PE, and a hostgroup.  So for example @mymachine-0 
>>>>>contains
>>>>>
>>>>>local0
>>>>>local1
>>>>>..
>>>>>local7
>>>>>
>>>>>local0-local7 are all connected via both the central cluster switch and
>>>>>          
>>>>>
>>>>>a
>>>>>local gigabit switch.
>>>>>
>>>>>I should also note that I am using hostname aliasing to ensure that the
>>>>>ethernet interface connected to the gigabit switch is used by Grid 
>>>>>Engine.
>>>>>So I have a file host_aliases file set up as follows:
>>>>>
>>>>>local0 node0
>>>>>local1 node1
>>>>>..
>>>>>local31 node31
>>>>>
>>>>>Where "nodeX" is the primary hostname for each machine and resolves to 
>>>>>the
>>>>>interface that connets to the central cluster switch.  "localX" resolves
>>>>>          
>>>>>
>>>>>to
>>>>>
>>>>>an address that connects via the gigabit interface if possible.  The
>>>>>"localX" names do not resolve consistently across the cluster -- for 
>>>>>example
>>>>>
>>>>>if I am on node0 and I ping local1, it will do so over the gigabit
>>>>>interface.  However if I am on node31 and I ping local1, it will do so 
>>>>>over
>>>>>
>>>>>the non-gigabit interface, because there is no gigabit connection 
>>>>>between
>>>>>node31 and node1.
>>>>>
>>>>>Tim
>>>>>
>>>>>----- Original Message ----- 
>>>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>>>To: <users at gridengine.sunsource.net>
>>>>>Sent: Friday, April 29, 2005 12:15 PM
>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>
>>>>>
>>>>>          
>>>>>
>>>>>>Tim,
>>>>>>
>>>>>>thanks, but I'm still not sure about your setup. You stated that you 
>>>>>>have
>>>>>>            
>>>>>>
>>>>>>32
>>>>>>dual machines. So you made a hostgroup @mymachine-0 with which 
>>>>>>machines
>>>>>>setup
>>>>>>therein? - And why so many queues at all?
>>>>>>
>>>>>>CU - Reuti
>>>>>>
>>>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>>>
>>>>>>            
>>>>>>
>>>>>>>Hi,
>>>>>>>
>>>>>>>I get:
>>>>>>>
>>>>>>>     59 0.55500 Job1    user        r     04/29/2005 10:47:08
>>>>>>>mymachine.q.0 at local0                     8
>>>>>>>       Full jobname:     Job1
>>>>>>>       Master queue:     mymachine.q.0 at local0
>>>>>>>       Requested PE:     *.mpi 8
>>>>>>>       Granted PE:       mymachine.3.mpi 8
>>>>>>>       Hard Resources:
>>>>>>>       Soft Resources:
>>>>>>>     47 0.55500 Job2    user        r     04/27/2005 14:45:04
>>>>>>>mymachine.q.0 at local6                     8
>>>>>>>       Full jobname:     Job2
>>>>>>>       Master queue:     mymachine.q.0 at local6
>>>>>>>       Requested PE:     *.mpi 8
>>>>>>>       Granted PE:       mymachine.3.mpi 8
>>>>>>>       Hard Resources:
>>>>>>>       Soft Resources:
>>>>>>>     44 0.55500 Job3    user        r     04/27/2005 11:55:49
>>>>>>>mymachineq.1 at local12                    8
>>>>>>>       Full jobname:     Job3
>>>>>>>       Master queue:     mymachine.q.1 at local12
>>>>>>>       Requested PE:     *.mpi 8
>>>>>>>       Granted PE:       mymachine.3.mpi 8
>>>>>>>       Hard Resources:
>>>>>>>       Soft Resources:
>>>>>>>     60 0.55500 Job4    user        r     04/29/2005 10:55:53
>>>>>>>mymachine.q.1 at local9                     8
>>>>>>>       Full jobname:     Job4
>>>>>>>       Master queue:     mymachine.q.1 at local9
>>>>>>>       Requested PE:     *.mpi 8
>>>>>>>       Granted PE:       mymachine.3.mpi 8
>>>>>>>       Hard Resources:
>>>>>>>       Soft Resources:
>>>>>>>     49 0.55500 Job5    user        r     04/27/2005 15:01:53
>>>>>>>mymachine.q.2 at local16                    8
>>>>>>>       Full jobname:     Job5
>>>>>>>       Master queue:     mymachine.q.2 at local16
>>>>>>>       Requested PE:     *.mpi 8
>>>>>>>       Granted PE:       mymachine.3.mpi 8
>>>>>>>       Hard Resources:
>>>>>>>       Soft Resources:
>>>>>>>     48 0.55500 Job6    user        r     04/27/2005 14:57:53
>>>>>>>mymachine.q.2 at local20                    8
>>>>>>>       Full jobname:     Job6
>>>>>>>       Master queue:     mymachine.q.2 at local20
>>>>>>>       Requested PE:     *.mpi 8
>>>>>>>       Granted PE:       mymachine.3.mpi 8
>>>>>>>       Hard Resources:
>>>>>>>       Soft Resources:
>>>>>>>     61 0.55500 Job7    user        r    04/29/2005 11:19:54
>>>>>>>8
>>>>>>>       Full jobname:     Job7
>>>>>>>       Requested PE:     *.mpi 8
>>>>>>>       Hard Resources:
>>>>>>>       Soft Resources:
>>>>>>>
>>>>>>>When I do qconf -sp mymachine.3.mpi, I get:
>>>>>>>
>>>>>>>pe_name           mymachine.3.mpi
>>>>>>>slots             16
>>>>>>>user_lists        NONE
>>>>>>>xuser_lists       NONE
>>>>>>>start_proc_args   /bin/true
>>>>>>>stop_proc_args    /opt/lam/intel/bin/sge-lamhalt
>>>>>>>allocation_rule   $round_robin
>>>>>>>control_slaves    TRUE
>>>>>>>job_is_first_task FALSE
>>>>>>>urgency_slots     avg
>>>>>>>
>>>>>>>When I do qconf -sq mymachine.q.0, I get
>>>>>>>
>>>>>>>qname                 mymachine.q.0
>>>>>>>hostlist              @mymachine-0
>>>>>>>seq_no                0
>>>>>>>load_thresholds       NONE
>>>>>>>suspend_thresholds    NONE
>>>>>>>nsuspend              1
>>>>>>>suspend_interval      00:05:00
>>>>>>>priority              0
>>>>>>>min_cpu_interval      00:05:00
>>>>>>>processors            UNDEFINED
>>>>>>>qtype                 BATCH INTERACTIVE
>>>>>>>ckpt_list             NONE
>>>>>>>pe_list               mymachine.0.mpi
>>>>>>>rerun                 FALSE
>>>>>>>slots                 2
>>>>>>>tmpdir                /tmp
>>>>>>>shell                 /bin/bash
>>>>>>>prolog                NONE
>>>>>>>epilog                NONE
>>>>>>>shell_start_mode      posix_compliant
>>>>>>>starter_method        NONE
>>>>>>>suspend_method        NONE
>>>>>>>resume_method         NONE
>>>>>>>terminate_method      NONE
>>>>>>>notify                00:00:60
>>>>>>>owner_list            sgeadmin
>>>>>>>user_lists            NONE
>>>>>>>xuser_lists           NONE
>>>>>>>subordinate_list      NONE
>>>>>>>complex_values        NONE
>>>>>>>projects              NONE
>>>>>>>xprojects             NONE
>>>>>>>calendar              NONE
>>>>>>>initial_state         default
>>>>>>>s_rt                  84:00:00
>>>>>>>h_rt                  84:15:00
>>>>>>>s_cpu                 INFINITY
>>>>>>>h_cpu                 INFINITY
>>>>>>>s_fsize               INFINITY
>>>>>>>h_fsize               INFINITY
>>>>>>>s_data                INFINITY
>>>>>>>h_data                INFINITY
>>>>>>>s_stack               INFINITY
>>>>>>>h_stack               INFINITY
>>>>>>>s_core                INFINITY
>>>>>>>h_core                INFINITY
>>>>>>>s_rss                 1G
>>>>>>>h_rss                 1G
>>>>>>>s_vmem                INFINITY
>>>>>>>h_vmem                INFINITY
>>>>>>>
>>>>>>>And so on, up to mymachine.q.3.
>>>>>>>
>>>>>>>Tim
>>>>>>>
>>>>>>>----- Original Message ----- 
>>>>>>>From: "Reuti" <reuti at staff.uni-marburg.de>
>>>>>>>To: <users at gridengine.sunsource.net>
>>>>>>>Sent: Friday, April 29, 2005 11:14 AM
>>>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>>>
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>>>Hi Tim,
>>>>>>>>
>>>>>>>>what is:
>>>>>>>>
>>>>>>>>qstat -r
>>>>>>>>
>>>>>>>>showing as granted PEs? - Reuti
>>>>>>>>
>>>>>>>>
>>>>>>>>Quoting Tim Mueller <tim_mueller at hotmail.com>:
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>>>Hi,
>>>>>>>>>
>>>>>>>>>That's the problem.  The setup is actually
>>>>>>>>>
>>>>>>>>>mymachine.q.0 references mymachine.0.mpi
>>>>>>>>>mymachine.q.1 references mymachine.1.mpi
>>>>>>>>>mymachine.q.2 references mymachine.2.mpi
>>>>>>>>>mymachine.q.3 references mymachine.3.mpi
>>>>>>>>>
>>>>>>>>>There is no reason, as far as I can tell, that a job could ever be
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>in
>>>>>>>>>both
>>>>>>>>>mymachine.3.mpi and mymachine.q.1.  And oddly enough, when I use
>>>>>>>>>wildcards,
>>>>>>>>>
>>>>>>>>>the the scheduler won't put a job assigned to mymachine.3.mpi
>>>>>>>>>                  
>>>>>>>>>
>>into
>>    
>>
>>>>>>>>>mymachine.q.3 until all of the other queues are full.  At that 
>>>>>>>>>point,
>>>>>>>>>it's
>>>>>>>>>too late because mymachine.3.mpi is using 48 slots, when it's
>>>>>>>>>                  
>>>>>>>>>
>>only
>>    
>>
>>>>>>>>>allowed
>>>>>>>>>to use up to 16.
>>>>>>>>>
>>>>>>>>>When I don't use wildcards, I get the behavior I expect:  A job
>>>>>>>>>submitted
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>to
>>>>>>>>>
>>>>>>>>>mymachine.3.mpi gets put in mymachine.q.3, etc.
>>>>>>>>>
>>>>>>>>>Tim
>>>>>>>>>
>>>>>>>>>----- Original Message ----- 
>>>>>>>>>From: "Stephan Grell - Sun Germany - SSG - Software Engineer"
>>>>>>>>><stephan.grell at sun.com>
>>>>>>>>>To: <users at gridengine.sunsource.net>
>>>>>>>>>Sent: Friday, April 29, 2005 2:34 AM
>>>>>>>>>Subject: Re: [GE users] Wildcards in PE still broken in 6.0u3
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>>>Hi Tim,
>>>>>>>>>>
>>>>>>>>>>I am not quite sure I understand your setup. Could you please
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>attach
>>>>>          
>>>>>
>>>>>>>>>>your
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>>cqueue configuration? From
>>>>>>>>>>the results you posted, it reads as if:
>>>>>>>>>>queue
>>>>>>>>>>mymachine.q.0  references mymachine.3.mpi
>>>>>>>>>>mymachine.q.1  reference mymachine.3.mpi
>>>>>>>>>>
>>>>>>>>>>and so on.
>>>>>>>>>>
>>>>>>>>>>Cheers,
>>>>>>>>>>Stephan
>>>>>>>>>>
>>>>>>>>>>Tim Mueller wrote:
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>>>Hi,
>>>>>>>>>>> It appears that wildcards in the Parallel Environment name 
>>>>>>>>>>>still
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>have
>>>>>>>              
>>>>>>>
>>>>>>>>>>>problems in 6.0u3.  I have set up a linux cluster of 32 dual
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>processor
>>>>>>>              
>>>>>>>
>>>>>>>>>>>Noconas running Linux.  There are 4 queues of 16 processors 
>>>>>>>>>>>each,
>>>>>>>>>>>and
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>a
>>>>>>>>>>>corresponding pe for each queue.  The queues are named as 
>>>>>>>>>>>follows:
>>>>>>>>>>> mymachine.q.0
>>>>>>>>>>>mymachine.q.1
>>>>>>>>>>>mymachine.q.2
>>>>>>>>>>>mymachine.q.3
>>>>>>>>>>> And the PE's are
>>>>>>>>>>> mymachine.0.mpi
>>>>>>>>>>>mymachine.1.mpi
>>>>>>>>>>>mymachine.2.mpi
>>>>>>>>>>>mymachine.3.mpi
>>>>>>>>>>> All of the PE's have 16 slots.  When I submit a job with the
>>>>>>>>>>>following
>>>>>>>>>>>line:
>>>>>>>>>>> #$ -pe *.mpi 8
>>>>>>>>>>> the job will be assigned to a seemingly random PE, but then 
>>>>>>>>>>>placed
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>in
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>a
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>queue that does not correspond to that PE.  I can submit up to
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>6
>>>>>>>>>>>jobs
>>>>>>>>>>>this way, each of which will get assigned to the same PE and
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>placed
>>>>>          
>>>>>
>>>>>>>in
>>>>>>>              
>>>>>>>
>>>>>>>>>>>any queue that does not correspond to the PE.  This causes 48
>>>>>>>>>>>processors
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>to be used for a PE with only 16 slots.  E.g., I might get:
>>>>>>>>>>> Job 1        mymachine.3.mpi        mymachine.q.0        8
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>processors
>>>>>>>              
>>>>>>>
>>>>>>>>>>>Job 2        mymachine.3.mpi        mymachine.q.0        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 3        mymachine.3.mpi        mymachine.q.1        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 4        mymachine.3.mpi        mymachine.q.1        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 5        mymachine.3.mpi        mymachine.q.2        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 6        mymachine.3.mpi        mymachine.q.2        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 7        qw
>>>>>>>>>>>Job 8        qw
>>>>>>>>>>> When I should get:
>>>>>>>>>>> Job 1        mymachine.0.mpi        mymachine.q.0        8
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>processors
>>>>>>>              
>>>>>>>
>>>>>>>>>>>Job 2        mymachine.0.mpi        mymachine.q.0        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 3        mymachine.1.mpi        mymachine.q.1        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 4        mymachine.1.mpi        mymachine.q.1        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 5        mymachine.2.mpi        mymachine.q.2        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 6        mymachine.2.mpi        mymachine.q.2        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 5        mymachine.3.mpi        mymachine.q.3        8
>>>>>>>>>>>processors
>>>>>>>>>>>Job 6        mymachine.3.mpi        mymachine.q.3        8
>>>>>>>>>>>processors
>>>>>>>>>>> If I try to then submit a job directly (with no wildcard) to 
>>>>>>>>>>>the
>>>>>>>>>>>PE
>>>>>>>>>>>that
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>all of the jobs were assigned to, it will not run because I 
>>>>>>>>>>>have
>>>>>>>>>>>already
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>far exceeded the slots limit for this PE.
>>>>>>>>>>> I should note that when I do not use wildcards, everything 
>>>>>>>>>>>behaves
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>as
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>it
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>should.  E.g, a job submitted to mymachine.2.mpi will be 
>>>>>>>>>>>assigned
>>>>>>>>>>>to
>>>>>>>>>>>mymachine.2.mpi and mymachine.2.q, and I cannot use more than 
>>>>>>>>>>>16
>>>>>>>>>>>slots
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>in
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>mymachine.2.mpi at once.
>>>>>>>>>>> I searched the list, and although there seem to have been 
>>>>>>>>>>>other
>>>>>>>>>>>problems
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>>with wildcards in the past, I have seen nothing that
>>>>>>>>>>>                      
>>>>>>>>>>>
>>references
>>    
>>
>>>>>>>>>>>this
>>>>>>>>>>>behavior.  Does anyone have an explanation / workaround?
>>>>>>>>>>> Tim
>>>>>>>>>>>                      
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>---------------------------------------------------------------------
>>>>>          
>>>>>
>>>>>>>>>>To unsubscribe, e-mail: 
>>>>>>>>>>users-unsubscribe at gridengine.sunsource.net
>>>>>>>>>>For additional commands, e-mail:
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>users-help at gridengine.sunsource.net
>>>>>          
>>>>>
>>>>>>>>>>                    
>>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>---------------------------------------------------------------------
>>    
>>
>>>>>>>>>To unsubscribe, e-mail:
>>>>>>>>>                  
>>>>>>>>>
>>users-unsubscribe at gridengine.sunsource.net
>>    
>>
>>>>>>>>>For additional commands, e-mail: 
>>>>>>>>>users-help at gridengine.sunsource.net
>>>>>>>>>
>>>>>>>>>                  
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>---------------------------------------------------------------------
>>    
>>
>>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>>For additional commands, e-mail: 
>>>>>>>>users-help at gridengine.sunsource.net
>>>>>>>>
>>>>>>>>
>>>>>>>>                
>>>>>>>>
>>>>>>>              
>>>>>>>
>>---------------------------------------------------------------------
>>    
>>
>>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>>
>>>>>>>              
>>>>>>>
>>>>>>
>>>>>>---------------------------------------------------------------------
>>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>>
>>>>>>
>>>>>>            
>>>>>>
>>>>>---------------------------------------------------------------------
>>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>>
>>>>>          
>>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>>
>>>>
>>>>        
>>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list