[GE users] problem with job distributions

mad margaret_Doll at brown.edu
Tue Mar 10 19:55:37 GMT 2009

On Mar 10, 2009, at 3:35 PM, mad wrote:

> I have compute nodes each of which have eight processors.  I have
> assigned eight compute nodes to one of my queues.  The compute nodes
> are listed as groupa   which is on the hostlist of group-a queue.  In
> the General Configuration for group-a queue, I have slots listed as
> 8.  When I look at the Cluster Queues, queue group-a has 64 total
> slots.  Currently 52 slots are shown as
> being  used in qmon.
> However,  when I execute  "qstat -f | grep group-a, I get
> group-a at compute-0-0.local       BIP   8/8       8.08     lx26-amd64
> group-a at compute-0-1.local       BIP   8/8       8.06     lx26-amd64
> group-a at compute-0-10.local      BIP   8/8       11.12    lx26-amd64
> group-a at compute-0-11.local      BIP   2/8       10.22    lx26-amd64
> group-a at compute-0-12.local      BIP   6/8       7.16     lx26-amd64
> group-a at compute-0-13.local      BIP   4/8       4.78     lx26-amd64
> group-a at compute-0-2.local       BIP   8/8       8.12     lx26-amd64
> group-a at compute-0-3.local       BIP   8/8       15.13    lx26- 
> amd64    a
> Total number of slots being used is 52 which agrees with qmon.
> However the load shows 59  jobs.
> If I ssh  into compute-0-3, I see 15 jobs  being used by one user.
> All jobs except one is using 50% of a CPU.
> My users say they are using variations of
> qsub -pe queue-a 20 scriptp
> Why would the distibution of jobs be so out of whack?  I have been
> running this cluster with this version of the system for about six
> months  now.   The only time the distribution was not even
> occurred before one of my users learned to use qsub properly.
> Running ROCKS 5.3 with Redhat 2.6.18-53.1.14.el5

I believe I found the problem.  One of the users in qmon show that
she has two jobs running.  When I ssh  into the compute nodes and
look at her jobs, she  has jobs submitted on three different days.

I am assuming that she did not successfully delete one of the jobs
that  she started.  How do  I catch these jobs except to keep a
close eye  on the queues with "qstat -f."  Is there some way  of
sending me an email when someone tries to delete a queued job,
so that I can see if the deletion was successful.

Why did the job continue, but not register  on qmon?

