[GE users] PE Slots Problem

Brian R. Smith brs at usf.edu
Thu Aug 9 21:39:01 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Andreas & Reuti,

No, there is no load threshold defined for that queue and there are no 
other jobs running on the host.  The load is at 0.00.  Is there any 
other possible information I can provide?

Thanks for your help.

-Brian

Andreas.Haas at Sun.COM wrote:
> Hi Brian,
>
> are you using a load threshold for smp.8.q? In combination with 
> sched_conf(5) job_load_adjustments this could cause high hypothetical 
> load for this 8-way job. Result is that scheduler decides not to 
> assign the job as to prevent
> that smp.8.q goes in load alarm state afterwards. Usually that is the 
> cause.
>
> Regards,
> Andreas
>
> On Thu, 9 Aug 2007, Brian R. Smith wrote:
>
>> I sent this message out when the mailing lists were down so I'm 
>> sending it again.  Also, I'd like to add the fact that I did ensure 
>> that any error states on hosts with PE mpi.p4 were cleared, in case 
>> anyone is wondering.
>>
>> ...
>>
>> Hi all,
>>
>> We're on GridEngine 6.0-u8 (yeah, i know, but we'll be upgrading to 6.1
>> in the next couple of weeks).  Its been fairly trouble-free but I've
>> just run into an interesting problem.  Perhaps someone can shed some
>> light.
>>
>> A user has submitted an 8-processor job to an 8-way opteron box.  The
>> queue for this box has been configured to support the parallel
>> environment mpi.p4 as we see here:
>>
>> [root at host ~]# qconf -sq smp.8.q | grep pe_list
>> pe_list               mpi.shm mpi.p4 ompi.tcp ompi openmp
>>
>> Also, the PE itself is configured like so:
>>
>> [root at host ~]# qconf -sp mpi.p4
>> pe_name           mpi.p4
>> slots             999
>> user_lists        NONE
>> xuser_lists       NONE
>> start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh 
>> $pe_hostfile  \
>>                 /usr/local/priv/mpi/bin/mpirun
>> stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
>> allocation_rule   $round_robin
>> control_slaves    TRUE
>> job_is_first_task FALSE
>> urgency_slots     min
>>
>> We can see from qstat -r that the PE is only being requested by this
>> particular job:
>>
>> [root at host ~]# qstat -r | grep mpi.p4
>>      Requested PE:     mpi.p4 8
>>
>> and that only 8 slots are being requested.
>>
>> When this job is submitted, it sits in the queue and qstat -j reports
>>
>> [root at host ~]# qstat -j
>> ==============================================================
>> job_number:                 41428
>> exec_file:                  job_scripts/41428
>> submission_time:            Thu Aug  2 11:16:25 2007
>> ...
>> cannot run in PE "mpi.p4" because it only offers 0 slots
>>
>> After blowing through a bunch of other queues with lower seq_no, it hits
>> smp.8.q.  There are enough slots on the queue to satisfy the job, but
>> the scheduler claims that there aren't enough provided by the PE.  I was
>> tempted to look for some global value for PE slots but that seems a bit
>> ridiculous.  Has anyone seen this before?  Is this a bug that was
>> corrected in a later release?  Did I miss something obvious?
>>
>> In case you are curious, my complete PE requests, in this case, are the
>> following:
>>
>> [root at irce qmaster]# qstat -r | grep PE
>>      Requested PE:     ompi.ib 1
>>      Granted PE:       ompi.ib 1
>>      Requested PE:     ompi.ib 4
>>      Granted PE:       ompi.ib 4
>>      Requested PE:     ompi.ib 40
>>      Granted PE:       ompi.ib 40
>>      Requested PE:     ompi.mx 10
>>      Granted PE:       ompi.mx 10
>>      Requested PE:     mpi.mx 4
>>      Granted PE:       mpi.mx 4
>>      Requested PE:     ompi.mx 14
>>      Granted PE:       ompi.mx 14
>>      Requested PE:     ompi.mx 8
>>      Granted PE:       ompi.mx 8
>>      Requested PE:     ompi.mx 14
>>      Requested PE:     ompi.mx 12
>>      Requested PE:     ompi.mx 12
>>      Requested PE:     ompi.mx 12
>>      Requested PE:     ompi.mx 12
>>      Requested PE:     ompi.mx 12
>>      Requested PE:     ompi.mx 14
>>      Requested PE:     mpi.p4 8
>>      Requested PE:     ompi.mx 10
>>      Requested PE:     ompi.tcp 8
>>
>> I'd appreciate any suggestions!
>>
>> Thanks,
>> Brian Smith
>>
>> -- 
>> Brian R. Smith
>> HPC Systems Administrator
>> Research Computing, University of South Florida
>> 4202 E. Fowler Ave. LIB618
>> Office Phone: +1 813 974-1467
>> Mobile Phone: +1 813 230-3441
>> Organization URL: http://rc.usf.edu
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>
> http://gridengine.info/
>
> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
> Kirchheim-Heimstetten
> Amtsgericht Muenchen: HRB 161028
> Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
> Vorsitzender des Aufsichtsrates: Martin Haering
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net


-- 
Brian R. Smith
HPC Systems Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. LIB618
Office Phone: +1 813 974-1467
Mobile Phone: +1 813 230-3441
Organization URL: http://rc.usf.edu

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list