[GE users] array jobs mess up fair share ??

Chris Rudge chris.rudge at astro.le.ac.uk
Tue Jul 8 13:36:21 BST 2008


Andreas,

Hopefully the following information taken from sge_share_mon output will
start to give a clue as to what's going wrong.

My understanding is that the "cpu" value is the cpu time used in
seconds. In a 15 second interval between reports from sge_share_mon, if
a project is efficiently using 10 cpus on the cluster, then the cpu
value would increase by 150 (15 seconds * 10 cpus).

For a project not using array jobs, I can see that this is indeed true -
give or take an allowance for inefficient parallel jobs. In 15 seconds,
the cpu value increases by about 1200 and the project is using about 80
cpus on the cluster.

For the project using array jobs, in a 15 second period the cpu value
increases by around 60,000 !?! This would suggest they're using about
4,000 cpus on the cluster. This is obviously wrong on our 264 cpu
cluster. I can see that the jobs for this project are:

 # qstat -u ajh67,hb100
job-ID  prior   name       user         state submit/start at     queue            slots ja-task-ID 
----------------------------------------------------------------------------------------------
 901068 1.37111 fsi_30_ajh ajh67        r     07/07/2008 12:15:45 default.q at comp24     1        
 901172 1.36839 run_job.sh hb100        r     07/07/2008 17:23:37 default.q at comp24     1 9031
 901172 1.36839 run_job.sh hb100        r     07/07/2008 17:23:37 default.q at comp24     1 9032
 901172 1.36839 run_job.sh hb100        r     07/07/2008 17:23:37 default.q at comp24     1 9033
 901067 1.37111 fsi_23_ajh ajh67        r     07/07/2008 12:15:45 default.q at comp36     1        
 901172 1.36839 run_job.sh hb100        r     07/07/2008 17:23:37 default.q at comp36     1 9034
 901172 1.36839 run_job.sh hb100        r     07/07/2008 17:23:37 default.q at comp36     1 9035
 901172 1.36839 run_job.sh hb100        r     07/07/2008 17:23:37 default.q at comp36     1 9036
 901172 1.36839 run_job.sh hb100        r     07/08/2008 11:16:21 default.q at comp65     1 9037
 901172 1.36839 run_job.sh hb100        r     07/08/2008 11:16:21 default.q at comp65     1 9038
 901172 1.36839 run_job.sh hb100        r     07/08/2008 11:17:06 default.q at comp65     1 9039
 901172 1.36839 run_job.sh hb100        r     07/08/2008 11:17:36 default.q at comp65     1 9040
 901172 1.26780 run_job.sh hb100        qw    07/07/2008 17:23:28                      1 9041-9060:1

i.e. two serial jobs for user ajh67 and an array job with 30 tasks for
user hb100 of which 10 are running. Note that these aren't the last 30
tasks of a 9060 task array job but are the 30 tasks of an array job with
task range 9031-9060.

Regards,
Chris


On Mon, 2008-07-07 at 18:08 +0200, Andreas.Haas at Sun.COM wrote:
> Hi Chris,
> 
> please find my reply in
> 
>     http://gridengine.sunsource.net/issues/show_bug.cgi?id=2298
> 
> 
> Regards,
> Andreas
> 

-- 
Dr Chris Rudge
chris.rudge at astro.le.ac.uk

Research Computing Manager
Dept of Physics & Astronomy
University of Leicester
LE1 7RH

web.  www.ukaff.ac.uk
Tel.  +44 (0)116 2523331
Fax.  +44 (0)116 2231283
Mob.  +44 (0)794 1379420


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list