[GE users] SMP trouble

Chris Dagdigian dag at sonsorol.org
Mon Sep 1 16:43:48 BST 2008


According to Grid Engine all is well -- your job is running and  
consuming 4 out of 4 available slots. Anything not quite working at  
this point would likely be an application issue. All you are getting  
from SGE in this context is exclusive use of the node for your  
threaded job to do what it needs to do.

-Chris


On Sep 1, 2008, at 11:16 AM, Luca Tola wrote:

> Chris Dagdigian ha scritto:
>>
>> Your configuration looks OK at first glance. The PE looks good, the  
>> PE is attached to the calcolo queue and your qstat shows empty  
>> queues with one queue down but the others idle and available.
>>
>> How do you know that your job is not running 4 threads via SMP?
> With "top" command. Later typing "1" I can see the situation of all  
> my CPUs.
>
>> Does it run fine on 4 CPUs outside of Grid Engine on that host?
> I don't know. But when I submit the same job in the cluster  
> "without" SGE all it's good.
>>
>> From SGE's qstat view you would expect to see "4/4" in the slots  
>> column for the calcolo queue indicating that 4 out of 4 available  
>> slots are blocked/consumed by your SGE job. There would really be  
>> no other indication from SGE other than that if the job runs and is  
>> dispatched OK.
> I show my qstat - f command when SGE runs
>
> [sgeadmin at sge jobs]$ qstat -f
>
> queuename                      qtype used/tot. load_avg  
> arch          states
> ----------------------------------------------------------------------------
> all.q at PING6                    BIP   0/1       -NA-     lx24- 
> x86      au
> ----------------------------------------------------------------------------
> all.q at cluster                  BIP   0/4       0.00     lx24-amd64    
> ----------------------------------------------------------------------------
> all.q at sge-glexec1              BIP   0/1       0.03     lx24-x86      
> ----------------------------------------------------------------------------
> calcolo at cluster                BIP   4/4       0.00     lx24- 
> amd64      467 0.55500 run_eng sgeadmin     r     09/01/2008  
> 17:03:13     4        
> ----------------------------------------------------------------------------
> post_processing at sge-glexec1    BIP   0/1       0.03     lx24-x86
>
>
>
>> -Chris
>>
>>
>>
>> On Sep 1, 2008, at 10:37 AM, Luca Tola wrote:
>>
>>> Hi all,
>>> I'm trying to run parallel SMP job on my same 4-CPU execution host.
>>> I've created my PE with qconf  -ap smp4 :
>>>
>>> pe_name             smp4
>>> slots                    4
>>> user_lists            NONE
>>> xuser_lists          NONE
>>> start_proc_args  NONE
>>> stop_proc_args  NONE
>>> allocation_rule   $pe_slots
>>> control_slaves    FALSE
>>> job_is_first_task TRUE
>>> urgency_slots     max
>>>
>>> My queue configuration is:
>>> qname                        calcolo
>>> hostlist                        cluster
>>> seq_no                        0
>>> load_thresholds          np_load_avg=1.75
>>> suspend_thresholds    NONE
>>> nsuspend                   1
>>> suspend_interval       00:05:00
>>> priority                      0
>>> min_cpu_interval      00:05:00
>>> processors                4
>>> qtype                        BATCH INTERACTIVE
>>> ckpt_list                   NONE
>>> pe_list                      make smp4
>>> rerun                       FALSE
>>> slots                         4
>>> tmpdir                     /tmp
>>> shell                        /bin/csh
>>> prolog                     NONE
>>> epilog                      NONE
>>> ...
>>> ...
>>> ...
>>> default
>>> ...
>>> ...
>>>
>>> When I try to submit with:
>>> qsub -q calcolo -pe smp4 4 run_eng.sh
>>> the job runs but on only one CPU instead of all CPU in my  
>>> execution host.
>>>
>>> qstat -f
>>>
>>> queuename                      qtype used/tot. load_avg  
>>> arch          states
>>> ----------------------------------------------------------------------------
>>> all.q at PING6                    BIP   0/1       -NA-     lx24- 
>>> x86      au
>>> ----------------------------------------------------------------------------
>>> all.q at cluster                  BIP   0/4       0.03     lx24- 
>>> amd64    
>>> ----------------------------------------------------------------------------
>>> all.q at sge-glexec1              BIP   0/1       0.00     lx24- 
>>> x86      
>>> ----------------------------------------------------------------------------
>>> calcolo at cluster                BIP   0/4       0.03     lx24- 
>>> amd64    
>>> ----------------------------------------------------------------------------
>>> post_processing at sge-glexec1    BIP   0/1       0.00     lx24-x86
>>>
>>> Can somebody tell me where I'making an error?
>>>
>>> Thanks
>>> Luca
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list