[GE users] PE Slots Problem

Andreas.Haas at Sun.COM Andreas.Haas at Sun.COM
Thu Aug 9 17:03:10 BST 2007


Hi Brian,

are you using a load threshold for smp.8.q? In combination with sched_conf(5) 
job_load_adjustments this could cause high hypothetical load for this 8-way 
job. Result is that scheduler decides not to assign the job as to prevent
that smp.8.q goes in load alarm state afterwards. Usually that is the cause.

Regards,
Andreas

On Thu, 9 Aug 2007, Brian R. Smith wrote:

> I sent this message out when the mailing lists were down so I'm sending it 
> again.  Also, I'd like to add the fact that I did ensure that any error 
> states on hosts with PE mpi.p4 were cleared, in case anyone is wondering.
>
> ...
>
> Hi all,
>
> We're on GridEngine 6.0-u8 (yeah, i know, but we'll be upgrading to 6.1
> in the next couple of weeks).  Its been fairly trouble-free but I've
> just run into an interesting problem.  Perhaps someone can shed some
> light.
>
> A user has submitted an 8-processor job to an 8-way opteron box.  The
> queue for this box has been configured to support the parallel
> environment mpi.p4 as we see here:
>
> [root at host ~]# qconf -sq smp.8.q | grep pe_list
> pe_list               mpi.shm mpi.p4 ompi.tcp ompi openmp
>
> Also, the PE itself is configured like so:
>
> [root at host ~]# qconf -sp mpi.p4
> pe_name           mpi.p4
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile  \
>                 /usr/local/priv/mpi/bin/mpirun
> stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
> allocation_rule   $round_robin
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> We can see from qstat -r that the PE is only being requested by this
> particular job:
>
> [root at host ~]# qstat -r | grep mpi.p4
>      Requested PE:     mpi.p4 8
>
> and that only 8 slots are being requested.
>
> When this job is submitted, it sits in the queue and qstat -j reports
>
> [root at host ~]# qstat -j
> ==============================================================
> job_number:                 41428
> exec_file:                  job_scripts/41428
> submission_time:            Thu Aug  2 11:16:25 2007
> ...
> cannot run in PE "mpi.p4" because it only offers 0 slots
>
> After blowing through a bunch of other queues with lower seq_no, it hits
> smp.8.q.  There are enough slots on the queue to satisfy the job, but
> the scheduler claims that there aren't enough provided by the PE.  I was
> tempted to look for some global value for PE slots but that seems a bit
> ridiculous.  Has anyone seen this before?  Is this a bug that was
> corrected in a later release?  Did I miss something obvious?
>
> In case you are curious, my complete PE requests, in this case, are the
> following:
>
> [root at irce qmaster]# qstat -r | grep PE
>      Requested PE:     ompi.ib 1
>      Granted PE:       ompi.ib 1
>      Requested PE:     ompi.ib 4
>      Granted PE:       ompi.ib 4
>      Requested PE:     ompi.ib 40
>      Granted PE:       ompi.ib 40
>      Requested PE:     ompi.mx 10
>      Granted PE:       ompi.mx 10
>      Requested PE:     mpi.mx 4
>      Granted PE:       mpi.mx 4
>      Requested PE:     ompi.mx 14
>      Granted PE:       ompi.mx 14
>      Requested PE:     ompi.mx 8
>      Granted PE:       ompi.mx 8
>      Requested PE:     ompi.mx 14
>      Requested PE:     ompi.mx 12
>      Requested PE:     ompi.mx 12
>      Requested PE:     ompi.mx 12
>      Requested PE:     ompi.mx 12
>      Requested PE:     ompi.mx 12
>      Requested PE:     ompi.mx 14
>      Requested PE:     mpi.p4 8
>      Requested PE:     ompi.mx 10
>      Requested PE:     ompi.tcp 8
>
> I'd appreciate any suggestions!
>
> Thanks,
> Brian Smith
>
> -- 
> Brian R. Smith
> HPC Systems Administrator
> Research Computing, University of South Florida
> 4202 E. Fowler Ave. LIB618
> Office Phone: +1 813 974-1467
> Mobile Phone: +1 813 230-3441
> Organization URL: http://rc.usf.edu
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

http://gridengine.info/

Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 Kirchheim-Heimstetten
Amtsgericht Muenchen: HRB 161028
Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
Vorsitzender des Aufsichtsrates: Martin Haering

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list