[GE users] PE Slots Problem

Brian R. Smith brs at usf.edu
Thu Aug 9 23:23:43 BST 2007


    [ The following text is in the "ISO-8859-1" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some special characters may be displayed incorrectly. ]

Got it.  It was just a problem with a boolean complex value not being 
addressed in the queue configuration.  Everything is working fine now.  
Thanks for your time.

-Brian

Brian R. Smith wrote:
> Andreas & Reuti,
>
> No, there is no load threshold defined for that queue and there are no 
> other jobs running on the host.  The load is at 0.00.  Is there any 
> other possible information I can provide?
>
> Thanks for your help.
>
> -Brian
>
> Andreas.Haas at Sun.COM wrote:
>> Hi Brian,
>>
>> are you using a load threshold for smp.8.q? In combination with 
>> sched_conf(5) job_load_adjustments this could cause high hypothetical 
>> load for this 8-way job. Result is that scheduler decides not to 
>> assign the job as to prevent
>> that smp.8.q goes in load alarm state afterwards. Usually that is the 
>> cause.
>>
>> Regards,
>> Andreas
>>
>> On Thu, 9 Aug 2007, Brian R. Smith wrote:
>>
>>> I sent this message out when the mailing lists were down so I'm 
>>> sending it again.  Also, I'd like to add the fact that I did ensure 
>>> that any error states on hosts with PE mpi.p4 were cleared, in case 
>>> anyone is wondering.
>>>
>>> ...
>>>
>>> Hi all,
>>>
>>> We're on GridEngine 6.0-u8 (yeah, i know, but we'll be upgrading to 6.1
>>> in the next couple of weeks).  Its been fairly trouble-free but I've
>>> just run into an interesting problem.  Perhaps someone can shed some
>>> light.
>>>
>>> A user has submitted an 8-processor job to an 8-way opteron box.  The
>>> queue for this box has been configured to support the parallel
>>> environment mpi.p4 as we see here:
>>>
>>> [root at host ~]# qconf -sq smp.8.q | grep pe_list
>>> pe_list               mpi.shm mpi.p4 ompi.tcp ompi openmp
>>>
>>> Also, the PE itself is configured like so:
>>>
>>> [root at host ~]# qconf -sp mpi.p4
>>> pe_name           mpi.p4
>>> slots             999
>>> user_lists        NONE
>>> xuser_lists       NONE
>>> start_proc_args   /usr/local/sge/mpi/startmpi.sh -catch_rsh 
>>> $pe_hostfile  \
>>>                 /usr/local/priv/mpi/bin/mpirun
>>> stop_proc_args    /usr/local/sge/mpi/stopmpi.sh
>>> allocation_rule   $round_robin
>>> control_slaves    TRUE
>>> job_is_first_task FALSE
>>> urgency_slots     min
>>>
>>> We can see from qstat -r that the PE is only being requested by this
>>> particular job:
>>>
>>> [root at host ~]# qstat -r | grep mpi.p4
>>>      Requested PE:     mpi.p4 8
>>>
>>> and that only 8 slots are being requested.
>>>
>>> When this job is submitted, it sits in the queue and qstat -j reports
>>>
>>> [root at host ~]# qstat -j
>>> ==============================================================
>>> job_number:                 41428
>>> exec_file:                  job_scripts/41428
>>> submission_time:            Thu Aug  2 11:16:25 2007
>>> ...
>>> cannot run in PE "mpi.p4" because it only offers 0 slots
>>>
>>> After blowing through a bunch of other queues with lower seq_no, it 
>>> hits
>>> smp.8.q.  There are enough slots on the queue to satisfy the job, but
>>> the scheduler claims that there aren't enough provided by the PE.  I 
>>> was
>>> tempted to look for some global value for PE slots but that seems a bit
>>> ridiculous.  Has anyone seen this before?  Is this a bug that was
>>> corrected in a later release?  Did I miss something obvious?
>>>
>>> In case you are curious, my complete PE requests, in this case, are the
>>> following:
>>>
>>> [root at irce qmaster]# qstat -r | grep PE
>>>      Requested PE:     ompi.ib 1
>>>      Granted PE:       ompi.ib 1
>>>      Requested PE:     ompi.ib 4
>>>      Granted PE:       ompi.ib 4
>>>      Requested PE:     ompi.ib 40
>>>      Granted PE:       ompi.ib 40
>>>      Requested PE:     ompi.mx 10
>>>      Granted PE:       ompi.mx 10
>>>      Requested PE:     mpi.mx 4
>>>      Granted PE:       mpi.mx 4
>>>      Requested PE:     ompi.mx 14
>>>      Granted PE:       ompi.mx 14
>>>      Requested PE:     ompi.mx 8
>>>      Granted PE:       ompi.mx 8
>>>      Requested PE:     ompi.mx 14
>>>      Requested PE:     ompi.mx 12
>>>      Requested PE:     ompi.mx 12
>>>      Requested PE:     ompi.mx 12
>>>      Requested PE:     ompi.mx 12
>>>      Requested PE:     ompi.mx 12
>>>      Requested PE:     ompi.mx 14
>>>      Requested PE:     mpi.p4 8
>>>      Requested PE:     ompi.mx 10
>>>      Requested PE:     ompi.tcp 8
>>>
>>> I'd appreciate any suggestions!
>>>
>>> Thanks,
>>> Brian Smith
>>>
>>> -- 
>>> Brian R. Smith
>>> HPC Systems Administrator
>>> Research Computing, University of South Florida
>>> 4202 E. Fowler Ave. LIB618
>>> Office Phone: +1 813 974-1467
>>> Mobile Phone: +1 813 230-3441
>>> Organization URL: http://rc.usf.edu
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>>
>>>
>>
>> http://gridengine.info/
>>
>> Sitz der Gesellschaft: Sun Microsystems GmbH, Sonnenallee 1, D-85551 
>> Kirchheim-Heimstetten
>> Amtsgericht Muenchen: HRB 161028
>> Geschaeftsfuehrer: Marcel Schneider, Wolfgang Engels, Dr. Roland Boemer
>> Vorsitzender des Aufsichtsrates: Martin Haering
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>


-- 
Brian R. Smith
HPC Systems Administrator
Research Computing, University of South Florida
4202 E. Fowler Ave. LIB618
Office Phone: +1 813 974-1467
Mobile Phone: +1 813 230-3441
Organization URL: http://rc.usf.edu

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list