[GE users] Accounting of Parallel Jobs

Bradford, Matthew matthew.bradford at eds.com
Wed Jan 30 10:37:14 GMT 2008


Reuti,

I'm not sure whether I explained everything very well.

Currently, when a user wants to submit an SCore job, they use something
like the following within their submitted script:

#$ masterq=masterq at headnode
	
	scrun 8x4 <application>

Where 8 represents the number of execution nodes they require, and 4
represents the number of cores per node.

They then request 9 nodes via SGE with

	qsub -pe score 9 <application_script>
 
Where 9 equates to the 8 execution nodes plus 1 extra for the parallel
jobs master node (which is the head node of the cluster). This is
stripped out using the PE start up script, which then populates the
SCore machine file and launches the SCore job. 

The thought is that there should never be more than 1 parallel job
running on an execution node, for performance reasons, which is why the
parallel queue has only 1 slot. 

The parallel queue only accepts parallel jobs, and there is a separate
queue for serial jobs, which has the same number of slots as there are
cores on the node. To prevent serial jobs and parallel jobs running on
the same node, the queues are sub-ordinates of each other.

It could be possible using the different allocation rules as you
suggest, and modify the PE startup scripts to provide the machine file
in the correct format for SCore, but it would also cause the master node
to use 4 slots as well, which is undesirable. Also, this would be a
static configuration, and if the user wanted to request (scrun 8x2
<application>) then we'd need another parallel environment, which is a
possibility. I'll need to investigate this further.

Thanks,

Mat



-----Original Message-----
From: Reuti [mailto:reuti at staff.uni-marburg.de] 
Sent: 29 January 2008 22:13
To: users at gridengine.sunsource.net
Subject: Re: [GE users] Accounting of Parallel Jobs

Hi,

Am 29.01.2008 um 22:24 schrieb Bradford, Matthew:

> integration with the SCore parallel environment, and SGE is unable to 
> record accurate usage of a job's CPU time. We are looking at the 
> ACCT_RESERVED_USAGE and SHARETREE_RESERVED_USAGE flags in the execd 
> params, which provides an improvement to the reporting as it gives us 
> (Wallclock time X slots), but the problem is, all the SCore parallel 
> jobs only use 1 slot per node, even though they are using all 4 cores 
> on a node. This would be OK if every job was a parallel SCore job, but

> some of the jobs are simple serial jobs, which run within a serial 
> queue, and use 1 slot per core. The accounting problem is then that a 
> serial job using 1 slot is reported to use the same amount of CPU as a

> parallel job, using 1 slot but 4 cores.
>
just submit also these jobs as parallel ones and request 4 slots. To get
them all on one node you need one PE with allocation_rule $PE_SLOTS and
4 slots on this machine, as there are 4 cores. If you need OTOH
4/8/12/... slots for this job in total you could alternatively setup the
allocation_rule to the fixed value 4.

In the extreme: make this queue a parallel only queue (qtype NONE) and
attach only one PE with fixed allocation rule 4.

-- Reuti

> This will cause problems when looking at a sharetree set up, as a 
> group which tends to run serial jobs will be penalised compared to a 
> group that tends to run parallel jobs.
>
> Is there any way of scaling the usage of the slots on a cluster queue 
> basis, so that a single slot within a parallel queue is equivalent to 
> 4 slots within a serial queue.
>
> Alternatively, and in the longer term, is there any intention of 
> providing the functionality where a user can request number of nodes, 
> and then number of cores per node, rather than the single "slots" 
> parameter. This would mean that the current configuration that we are 
> using, where the parallel queues only offer 1 slot, could be changed 
> so that SGE understands that a user is requesting multiple cores, and 
> would reduce the reporting anomaly.
>
> Any advice would be much appreciated.
>
> Thanks,
>
> Mat
>
>
> Matthew Bradford
> Information Analyst
> Applications Services Field Operations EMEA UKIMEA RABU EDS c/o 
> Rolls-Royce Plc, Moor Lane PO Box 31 Derby
> DE24 8BJ
>
> email:  matthew.bradford at eds.com
> Office: +44 01332 2 22059
>
> This message contains information which may be confidential and 
> privileged. Unless you are the intended addressee (or authorised to 
> receive for the addressee) you may not use, copy or disclose to anyone

> the message or any information contained in this message. If you have 
> received this message in error, please advise the sender by reply 
> email and delete the message.
> (c) 2005 Electronic Data Systems Corporation. All rights reserved.
>
> Electronic Data Systems Ltd
> Registered Office:, Lansdowne House, Berkeley Square, London  W1J 6ER 
> Registered in England no: 53419 VAT number: 432 99 5915
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list