[GE users] Accounting of Parallel Jobs

Bradford, Matthew matthew.bradford at eds.com
Thu Jan 31 12:23:15 GMT 2008


 

>-----Original Message-----
>From: Reuti [mailto:reuti at staff.uni-marburg.de] 
>Sent: 30 January 2008 21:48
>To: users at gridengine.sunsource.net
>Subject: Re: [GE users] Accounting of Parallel Jobs
>
>Hi,
>
>Am 30.01.2008 um 11:37 schrieb Bradford, Matthew:
>
>> I'm not sure whether I explained everything very well.
>>
>> Currently, when a user wants to submit an SCore job, they use 
>> something like the following within their submitted script:
>>
>> #$ masterq=masterq at headnode
>
>aha, now I see. I thought that there was the discussion, that 
>also the slave tasks will possibly end up on the head node?

This would happen if the allocation rule of the parallel environment was
not set to 1. 

>
>> 	
>> 	scrun 8x4 <application>
>>
>> Where 8 represents the number of execution nodes they require, and 4 
>> represents the number of cores per node.
>>
>> They then request 9 nodes via SGE with
>>
>> 	qsub -pe score 9 <application_script>
>
>But this is a different/bigger issue than just accounting, more:
>
>http://gridengine.sunsource.net/issues/show_bug.cgi?id=75
>
>and all referenced issues.

Do you think that there will be any decision on this in the near future?
Would it be possible to use a similar mechanism to the PBS family of
products, where, I think, a range of resources can be requested, such as


qsub -l select=2:ncpus=2+1:ncpus=3+3:green=True:ncpus=1 

which is a request for 2 nodes with 2cpus + 1 node with 3 cpus + 3 nodes
with attribute green and 1 CPU. This would allow a request to specify 1
node for master and multiple different nodes for the slaves. Any
thoughts?



>
>
>> Where 9 equates to the 8 execution nodes plus 1 extra for 
>the parallel 
>> jobs master node (which is the head node of the cluster). This is 
>> stripped out using the PE start up script, which then populates the 
>> SCore machine file and launches the SCore job.
>>
>> The thought is that there should never be more than 1 parallel job 
>> running on an execution node, for performance reasons, which is why 
>> the parallel queue has only 1 slot.
>
>And if you also request 4 slots in the masterq? Then it might 
>of course compute too much usage...



>>> integration with the SCore parallel environment, and SGE is 
>unable to 
>>> record accurate usage of a job's CPU time. We are looking at the
>
>Why is cputime not working? It's not tightly integrated?

That's right. I don't know how to go about tightly integrating the SCore
parallel environment with SGE. 

>
>-- Reuti
>
>
>>> ACCT_RESERVED_USAGE and SHARETREE_RESERVED_USAGE flags in the execd 
>>> params, which provides an improvement to the reporting as 
>it gives us 
>>> (Wallclock time X slots), but the problem is, all the SCore 
>parallel 
>>> jobs only use 1 slot per node, even though they are using 
>all 4 cores 
>>> on a node. This would be OK if every job was a parallel SCore job, 
>>> but
>>
>>> some of the jobs are simple serial jobs, which run within a serial 
>>> queue, and use 1 slot per core. The accounting problem is 
>then that a 
>>> serial job using 1 slot is reported to use the same amount 
>of CPU as 
>>> a
>>
>>> parallel job, using 1 slot but 4 cores.
>>>
>> just submit also these jobs as parallel ones and request 4 slots.  
>> To get
>> them all on one node you need one PE with allocation_rule $PE_SLOTS 
>> and
>> 4 slots on this machine, as there are 4 cores. If you need OTOH 
>> 4/8/12/... slots for this job in total you could alternatively setup 
>> the allocation_rule to the fixed value 4.
>>
>> In the extreme: make this queue a parallel only queue (qtype 
>NONE) and 
>> attach only one PE with fixed allocation rule 4.
>>
>> -- Reuti
>>
>>> This will cause problems when looking at a sharetree set up, as a 
>>> group which tends to run serial jobs will be penalised 
>compared to a 
>>> group that tends to run parallel jobs.
>>>
>>> Is there any way of scaling the usage of the slots on a 
>cluster queue 
>>> basis, so that a single slot within a parallel queue is 
>equivalent to
>>> 4 slots within a serial queue.
>>>
>>> Alternatively, and in the longer term, is there any intention of 
>>> providing the functionality where a user can request number 
>of nodes, 
>>> and then number of cores per node, rather than the single "slots"
>>> parameter. This would mean that the current configuration 
>that we are 
>>> using, where the parallel queues only offer 1 slot, could 
>be changed 
>>> so that SGE understands that a user is requesting multiple 
>cores, and 
>>> would reduce the reporting anomaly.
>>>
>>> Any advice would be much appreciated.
>>>
>>> Thanks,
>>>
>>> Mat
>>>
>>>
>>> Matthew Bradford
>>> Information Analyst
>>> Applications Services Field Operations EMEA UKIMEA RABU EDS c/o 
>>> Rolls-Royce Plc, Moor Lane PO Box 31 Derby
>>> DE24 8BJ
>>>
>>> email:  matthew.bradford at eds.com
>>> Office: +44 01332 2 22059
>>>
>>> This message contains information which may be confidential and 
>>> privileged. Unless you are the intended addressee (or authorised to 
>>> receive for the addressee) you may not use, copy or disclose to 
>>> anyone
>>
>>> the message or any information contained in this message. 
>If you have 
>>> received this message in error, please advise the sender by reply 
>>> email and delete the message.
>>> (c) 2005 Electronic Data Systems Corporation. All rights reserved.
>>>
>>> Electronic Data Systems Ltd
>>> Registered Office:, Lansdowne House, Berkeley Square, 
>London  W1J 6ER 
>>> Registered in England no: 53419 VAT number: 432 99 5915
>>>
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>> For additional commands, e-mail: users-help at gridengine.sunsource.net
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
>For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list