[GE users] problem with job distributions

mad margaret_Doll at brown.edu
Tue Mar 10 20:05:24 GMT 2009


On Mar 10, 2009, at 3:55 PM, mad wrote:

> On Mar 10, 2009, at 3:35 PM, mad wrote:
>
>> I have compute nodes each of which have eight processors.  I have
>> assigned eight compute nodes to one of my queues.  The compute nodes
>> are listed as groupa   which is on the hostlist of group-a queue.  In
>> the General Configuration for group-a queue, I have slots listed as
>> 8.  When I look at the Cluster Queues, queue group-a has 64 total
>> slots.  Currently 52 slots are shown as
>> being  used in qmon.
>>
>> However,  when I execute  "qstat -f | grep group-a, I get
>>
>> group-a at compute-0-0.local       BIP   8/8       8.08     lx26-amd64
>> group-a at compute-0-1.local       BIP   8/8       8.06     lx26-amd64
>> group-a at compute-0-10.local      BIP   8/8       11.12    lx26-amd64
>> group-a at compute-0-11.local      BIP   2/8       10.22    lx26-amd64
>> group-a at compute-0-12.local      BIP   6/8       7.16     lx26-amd64
>> group-a at compute-0-13.local      BIP   4/8       4.78     lx26-amd64
>> group-a at compute-0-2.local       BIP   8/8       8.12     lx26-amd64
>> group-a at compute-0-3.local       BIP   8/8       15.13    lx26-
>> amd64    a
>>
>> Total number of slots being used is 52 which agrees with qmon.
>> However the load shows 59  jobs.
>>
>> If I ssh  into compute-0-3, I see 15 jobs  being used by one user.
>> All jobs except one is using 50% of a CPU.
>>
>> My users say they are using variations of
>>
>> qsub -pe queue-a 20 scriptp
>>
>>
>> Why would the distibution of jobs be so out of whack?  I have been
>> running this cluster with this version of the system for about six
>> months  now.   The only time the distribution was not even
>> occurred before one of my users learned to use qsub properly.
>>
>>
>>
>> Running ROCKS 5.3 with Redhat 2.6.18-53.1.14.el5
>
> I believe I found the problem.  One of the users in qmon show that
> she has two jobs running.  When I ssh  into the compute nodes and
> look at her jobs, she  has jobs submitted on three different days.
>
> I am assuming that she did not successfully delete one of the jobs
> that  she started.  How do  I catch these jobs except to keep a
> close eye  on the queues with "qstat -f."  Is there some way  of
> sending me an email when someone tries to delete a queued job,
> so that I can see if the deletion was successful.
>
> Why did the job continue, but not register  on qmon?

qstat -g c  shows

queue-a  1.10     52     12     64      8      0

at one time the CGLOAD was up to 1.22


>
>
>>
>>
>> ------------------------------------------------------
>> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=126858
>>
>> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net
>> ].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=126864
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=126877

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list