[GE users] PE Slots Problem

Reuti reuti at staff.uni-marburg.de
Thu Aug 9 16:23:13 BST 2007


Am 09.08.2007 um 15:52 schrieb Brian R. Smith:

> I sent this message out when the mailing lists were down so I'm  
> sending it again.  Also, I'd like to add the fact that I did ensure  
> that any error states on hosts with PE mpi.p4 were cleared, in case  
> anyone is wondering.
>
> ...
>
> Hi all,
>
> We're on GridEngine 6.0-u8 (yeah, i know, but we'll be upgrading to  
> 6.1
> in the next couple of weeks).  Its been fairly trouble-free but I've
> just run into an interesting problem.  Perhaps someone can shed some
> light.
>
> A user has submitted an 8-processor job to an 8-way opteron box.  The
> queue for this box has been configured to support the parallel
> environment mpi.p4 as we see here:
>
> [root at host ~]# qconf -sq smp.8.q | grep pe_list
> pe_list               mpi.shm mpi.p4 ompi.tcp ompi openmp

Is there already something running on this machines, which qstat -f  
could tell you, as there is more than one PE? - Reuti


> Also, the PE itself is configured like so:
>
> [root at host ~]# qconf -sp mpi.p4
> pe_name           mpi.p4
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh  
> $pe_hostfile  \
>                  /usr/local/priv/mpi/bin/mpirun
> stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> We can see from qstat -r that the PE is only being requested by this
> particular job:
>
> [root at host ~]# qstat -r | grep mpi.p4
>       Requested PE:     mpi.p4 8
>
> and that only 8 slots are being requested.
>
> When this job is submitted, it sits in the queue and qstat -j reports
>
> [root at host ~]# qstat -j
> ==============================================================
> job_number:                 41428
> exec_file:                  job_scripts/41428
> submission_time:            Thu Aug  2 11:16:25 2007
> ...
> cannot run in PE "mpi.p4" because it only offers 0 slots
>
> After blowing through a bunch of other queues with lower seq_no, it  
> hits
> smp.8.q.  There are enough slots on the queue to satisfy the job, but
> the scheduler claims that there aren't enough provided by the PE.   
> I was
> tempted to look for some global value for PE slots but that seems a  
> bit
> ridiculous.  Has anyone seen this before?  Is this a bug that was
> corrected in a later release?  Did I miss something obvious?
>
> In case you are curious, my complete PE requests, in this case, are  
> the
> following:
>
> [root at irce qmaster]# qstat -r | grep PE
>       Requested PE:     ompi.ib 1
>       Granted PE:       ompi.ib 1
>       Requested PE:     ompi.ib 4
>       Granted PE:       ompi.ib 4
>       Requested PE:     ompi.ib 40
>       Granted PE:       ompi.ib 40
>       Requested PE:     ompi.mx 10
>       Granted PE:       ompi.mx 10
>       Requested PE:     mpi.mx 4
>       Granted PE:       mpi.mx 4
>       Requested PE:     ompi.mx 14
>       Granted PE:       ompi.mx 14
>       Requested PE:     ompi.mx 8
>       Granted PE:       ompi.mx 8
>       Requested PE:     ompi.mx 14
>       Requested PE:     ompi.mx 12
>       Requested PE:     ompi.mx 12
>       Requested PE:     ompi.mx 12
>       Requested PE:     ompi.mx 12
>       Requested PE:     ompi.mx 12
>       Requested PE:     ompi.mx 14
>       Requested PE:     mpi.p4 8
>       Requested PE:     ompi.mx 10
>       Requested PE:     ompi.tcp 8
>
> I'd appreciate any suggestions!
>
> Thanks,
> Brian Smith
>
> -- 
> Brian R. Smith
> HPC Systems Administrator
> Research Computing, University of South Florida
> 4202 E. Fowler Ave. LIB618
> Office Phone: +1 813 974-1467
> Mobile Phone: +1 813 230-3441
> Organization URL: http://rc.usf.edu
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list