[GE users] Re: [GE users] resource reservation not working

reuti reuti at staff.uni-marburg.de
Mon Aug 24 13:03:54 BST 2009


    [ The following text is in the "utf-8" character set. ]
    [ Your display is set for the "ISO-8859-10" character set.  ]
    [ Some characters may be displayed incorrectly. ]

Hiho,

Am 24.08.2009 um 11:56 schrieb matbradford:

> >> I think I was mistaken about reservation not working --- It just  
> doesn't
> >> work the way I thought it would. What I expected was that, as  
> resources
> >> (slots) came free, the scheduler would set them aside for the  
> reserving job
> >> until it had accumulated enough to run it. Instead what happens  
> is that the
> >> scheduler picks an arbitrary list of nodes that may *or may not*  
> have free
> >> slots, and sets those slots aside as they come free. If slots  
> come free
> >> that are *not* on this preselected list, they are cheerfully  
> assigned to
> >> other
> >> jobs, even those of lower priority than the reserving job.
> >
> >Could it be that these other nodes were actually not suited for this
> >large parallel job? Reason could be 'oneper' is not contained in
> >"pe_list" for the corresponding queue instances, reason could be
> >resource requests with your job that can be satisfied only at a  
> subset
> >of the queue instances, reason could be load thresholds with these  
> queue
> >instances etc.
> >
> >>
> >> The indirect evidence of this was right there in the monitor file
> >> ($SGE_ROOT/$SGE_CE?LL/common/schedule) when I had it turned on,  
> but
> >> I looked right past it: The reserving job has a list of queue  
> instances
> >> associated
> >> with it:
> >>
> >> 3568:1:RESERVING:119?0724115:660:P:oneper?:slots:20.000000
> >> 3568:1:RESERVING:119?0724115:660:Q:all.q@?cl023:slots: 
> 1.000000?
> >> 3568:1:RESERVING:119?0724115:660:Q:all.q@?cl026:slots: 
> 1.000000?
> >>
> >> ...and the list never changes! I suspect now that what happened  
> earlier was
> >> that a node *not* on the reserved list came free, and the job I  
> thought was
> >> violating the reservation policy was scheduled there. That's  
> certainly what
> >> happened with some jobs that were scheduled last night.
> >>
> >> I suppose there ought to be a request-for-enhancement about  
> this: If
> >> the scheduler were smart enough to glom resources *as they  
> became available*,
> >> rather than preselecting them (who knows how?), then reservation  
> would
> >> probably be a more effective function.
> >
> >Jobs' resource reservations are done anew with each scheduling
> >interval. So actually the RFE is already implemented ;-)
>
> >
> >Regards,
> >Andreas
>
> Was this issue actually resolved, as we get the same issue.
>
> A user has submitted a job requiring 16 slots and the job is  
> sitting at the top of the pending queue, the scheduler has reserved  
> 16 slots for this job. At this point there were only 3 available  
> nodes,  and the scheduler correctly selected the free nodes, and 13  
> other nodes for reservation. No other small jobs were sitting in  
> the pending queue at this time. Subsequently, other smaller jobs  
> were submitted without a reservation request against them as they  
> only required 1 node. As running jobs finished, I would have  
> expected the scheduler to begin reserving the freed up nodes for  
> the 16 slot job at the top of the pending queue. This isn?t the  
> case, the reserved list of nodes isn?t changing and the smaller  
> jobs are still hopping over the large job.
>
> Any thoughts?
>
do you request h_rt or s_rt for oll the jobs?

==

Or could it be related to this issue:

http://gridengine.sunsource.net/issues/show_bug.cgi?id=2761

Do you have any custom complexes?

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=213933

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list