[GE users] problem with job distributions

mhanby mhanby at uab.edu
Tue Mar 10 20:32:36 GMT 2009


What type of MPI are the users utilizing? I see this behavior with our
LAM MPI jobs. The others can provide more insight but I believe that
this will end up being an issue with loose integrated parallel
libraries. Do a search for 'tight integration' for more detail.

What is probably happening, your user is using qdel to delete their job,
the master process for the job receives the signal to halt but doesn't
clean up the child tasks. As far as SGE knows, the job has been killed
and all slots are now available for use.

Mike

-----Original Message-----
From: mad [mailto:margaret_Doll at brown.edu] 
Sent: Tuesday, March 10, 2009 3:05 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] problem with job distributions

On Mar 10, 2009, at 3:55 PM, mad wrote:

> On Mar 10, 2009, at 3:35 PM, mad wrote:
>
>> I have compute nodes each of which have eight processors.  I have
>> assigned eight compute nodes to one of my queues.  The compute nodes
>> are listed as groupa   which is on the hostlist of group-a queue.  In
>> the General Configuration for group-a queue, I have slots listed as
>> 8.  When I look at the Cluster Queues, queue group-a has 64 total
>> slots.  Currently 52 slots are shown as
>> being  used in qmon.
>>
>> However,  when I execute  "qstat -f | grep group-a, I get
>>
>> group-a at compute-0-0.local       BIP   8/8       8.08     lx26-amd64
>> group-a at compute-0-1.local       BIP   8/8       8.06     lx26-amd64
>> group-a at compute-0-10.local      BIP   8/8       11.12    lx26-amd64
>> group-a at compute-0-11.local      BIP   2/8       10.22    lx26-amd64
>> group-a at compute-0-12.local      BIP   6/8       7.16     lx26-amd64
>> group-a at compute-0-13.local      BIP   4/8       4.78     lx26-amd64
>> group-a at compute-0-2.local       BIP   8/8       8.12     lx26-amd64
>> group-a at compute-0-3.local       BIP   8/8       15.13    lx26-
>> amd64    a
>>
>> Total number of slots being used is 52 which agrees with qmon.
>> However the load shows 59  jobs.
>>
>> If I ssh  into compute-0-3, I see 15 jobs  being used by one user.
>> All jobs except one is using 50% of a CPU.
>>
>> My users say they are using variations of
>>
>> qsub -pe queue-a 20 scriptp
>>
>>
>> Why would the distibution of jobs be so out of whack?  I have been
>> running this cluster with this version of the system for about six
>> months  now.   The only time the distribution was not even
>> occurred before one of my users learned to use qsub properly.
>>
>>
>>
>> Running ROCKS 5.3 with Redhat 2.6.18-53.1.14.el5
>
> I believe I found the problem.  One of the users in qmon show that
> she has two jobs running.  When I ssh  into the compute nodes and
> look at her jobs, she  has jobs submitted on three different days.
>
> I am assuming that she did not successfully delete one of the jobs
> that  she started.  How do  I catch these jobs except to keep a
> close eye  on the queues with "qstat -f."  Is there some way  of
> sending me an email when someone tries to delete a queued job,
> so that I can see if the deletion was successful.
>
> Why did the job continue, but not register  on qmon?

qstat -g c  shows

queue-a  1.10     52     12     64      8      0

at one time the CGLOAD was up to 1.22


>
>
>>
>>
>> ------------------------------------------------------
>>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=126858
>>
>> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net
>> ].
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=126864
>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=126877

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=126886

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list