[GE users] Resource Reservation Issue

Bradford, Matthew matthew.bradford at eds.com
Sun Sep 14 12:01:07 BST 2008


Also noticed this effect with Resource Reservation:

I have 2 nodes, and 2 queues. Each node has a parallel queue and a
serial queue. The queues subordinate each other to prevent the running
of serial jobs and parallel jobs at the same time on the same node.

If I submit a serial job to comp1 node and another serial job to comp2
node, they both happily start executing, and the parallel queues on both
nodes are suspended due to subordination.

I can then submit a parallel job, such as

Qsub -R y -pe my_pe 2 mpi_job.sh

If I look in the schedule file, I can see that the scheduler is not
making any reservations, I'm assuming because the two parallel queues
are suspended, and therefore not considered.

If I kill one of the serial jobs, then the queue state of the parallel.q
on that node becomes available again, as the subordinating queue doesn't
have any jobs running in it.

If I now look at the schedule file, I would expect the scheduler to
reserve that node to be used for the parallel job. This doesn't however
appear to be the case. There is no scheduling being observed in the
schedule file.

Is this because the scheduler will only start reserving nodes for a
specific job if there are enough active (non-suspended) queues at that
moment in time that can meet the job requirements, and the only reason
the job can't run is due to other jobs already running in those queues?
The scheduler does not seem to make build up the reservation pool for a
job as slots become available, but will only activate the reservation
when all slots are active. If this is the case, then it looks like
subordination between queues causes problems. If, as in the test case
described, there is another job in the pending queue requesting a single
slot in the serial queue, it will be able to run as SGE has not placed a
reservation on the compute node. This means that it is possible that the
parallel job may never get enough resource available to start.

Any thoughts?



>-----Original Message-----
>From: Bradford, Matthew [mailto:matthew.bradford at eds.com] 
>Sent: 14 September 2008 11:22
>To: users at gridengine.sunsource.net
>Subject: RE: [GE users] Resource Reservation Issue
>No, we used qalter once the jobs were running, we manually 
>reduced there h_rt time to see whether this changes  which 
>nodes the scheduler thinks will become available first, and 
>therefore use these in the reservation for the reserving job.
>I think my understanding of the use of h_rt was wrong. If I 
>now understand correctly, it is part of the request stating 
>that the job requires this much run time rather than a 
>attribute of the job itself.
>The scheduler will then use the h_rt value to select the 
>appropriate queue. Once the job has started, I assume that 
>altering this value has no effect.
>Maybe that isn't the best way of testing this. We have tried a 
>similar test, with a cluster full of jobs, and a pending job 
>requesting resource reservation of 4 slots. Even killing some 
>of the running jobs in the cluster to free up 4 slots, the 
>scheduler still doesn't alter the reservation to use them, and 
>if jobs lower down in the pending queue are present, they can 
>jump in and start using the newly available slots.
>Basically, we have users complaining that their jobs, which 
>are sitting at the top of the pending queue, with reservations 
>set, are not the next jobs to execute.
>Any thoughts would be most helpful.
>>-----Original Message-----
>>From: Reuti [mailto:reuti at staff.uni-marburg.de]
>>Sent: 12 September 2008 23:02
>>To: users at gridengine.sunsource.net
>>Subject: Re: [GE users] Resource Reservation Issue
>>Am 11.09.2008 um 13:21 schrieb Bradford, Matthew:
>>> We are running SGE 6.0u8 and have a problem with how resource 
>>> reservation works.
>>> For example:
>>> We have a full cluster running mainly parallel jobs of 
>various sizes 
>>> from 1 node to 16 nodes.
>>> We allow only 2 jobs to be run with a Reserve flag on them.
>>> Users aren't specifying an h_rt, and the default runtime is
>>set to 480
>>> hours.
>>> Occasionally, we set important jobs with a reserve flag to prevent 
>>> resource starvation, and we push the job to the top of the pending 
>>> queue.
>>> What we appear to get when we switch on monitoring is the scheduler 
>>> selecting the nodes for the reserved job, but then not 
>amending that 
>>> selection even when there are free nodes in the system.
>>> During testing of this we noticed no difference when we
>>explicitly set
>>> the h_rt to 480 hours for all jobs, and then reduce the h_rt for 
>>> specific jobs. We thought that the scheduler would recalculate and 
>>> select nodes where it knows that jobs are nearly finished.
>>what do you mean by "reducing" - the waiting jobs? With qalter while 
>>they are waiting?
>>-- Reuti
>>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>>For additional commands, e-mail: users-help at gridengine.sunsource.net
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net

To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net

More information about the gridengine-users mailing list