[GE users] problem with job distributions
mhanby
mhanby at uab.edu
Tue Mar 10 20:32:36 GMT 2009
What type of MPI are the users utilizing? I see this behavior with our
LAM MPI jobs. The others can provide more insight but I believe that
this will end up being an issue with loose integrated parallel
libraries. Do a search for 'tight integration' for more detail.
What is probably happening, your user is using qdel to delete their job,
the master process for the job receives the signal to halt but doesn't
clean up the child tasks. As far as SGE knows, the job has been killed
and all slots are now available for use.
Mike
-----Original Message-----
From: mad [mailto:margaret_Doll at brown.edu]
Sent: Tuesday, March 10, 2009 3:05 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] problem with job distributions
On Mar 10, 2009, at 3:55 PM, mad wrote:
> On Mar 10, 2009, at 3:35 PM, mad wrote:
>
>> I have compute nodes each of which have eight processors. I have
>> assigned eight compute nodes to one of my queues. The compute nodes
>> are listed as groupa which is on the hostlist of group-a queue. In
>> the General Configuration for group-a queue, I have slots listed as
>> 8. When I look at the Cluster Queues, queue group-a has 64 total
>> slots. Currently 52 slots are shown as
>> being used in qmon.
>>
>> However, when I execute "qstat -f | grep group-a, I get
>>
>> group-a at compute-0-0.local BIP 8/8 8.08 lx26-amd64
>> group-a at compute-0-1.local BIP 8/8 8.06 lx26-amd64
>> group-a at compute-0-10.local BIP 8/8 11.12 lx26-amd64
>> group-a at compute-0-11.local BIP 2/8 10.22 lx26-amd64
>> group-a at compute-0-12.local BIP 6/8 7.16 lx26-amd64
>> group-a at compute-0-13.local BIP 4/8 4.78 lx26-amd64
>> group-a at compute-0-2.local BIP 8/8 8.12 lx26-amd64
>> group-a at compute-0-3.local BIP 8/8 15.13 lx26-
>> amd64 a
>>
>> Total number of slots being used is 52 which agrees with qmon.
>> However the load shows 59 jobs.
>>
>> If I ssh into compute-0-3, I see 15 jobs being used by one user.
>> All jobs except one is using 50% of a CPU.
>>
>> My users say they are using variations of
>>
>> qsub -pe queue-a 20 scriptp
>>
>>
>> Why would the distibution of jobs be so out of whack? I have been
>> running this cluster with this version of the system for about six
>> months now. The only time the distribution was not even
>> occurred before one of my users learned to use qsub properly.
>>
>>
>>
>> Running ROCKS 5.3 with Redhat 2.6.18-53.1.14.el5
>
> I believe I found the problem. One of the users in qmon show that
> she has two jobs running. When I ssh into the compute nodes and
> look at her jobs, she has jobs submitted on three different days.
>
> I am assuming that she did not successfully delete one of the jobs
> that she started. How do I catch these jobs except to keep a
> close eye on the queues with "qstat -f." Is there some way of
> sending me an email when someone tries to delete a queued job,
> so that I can see if the deletion was successful.
>
> Why did the job continue, but not register on qmon?
qstat -g c shows
queue-a 1.10 52 12 64 8 0
at one time the CGLOAD was up to 1.22
>
>
>>
>>
>> ------------------------------------------------------
>>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=126858
>>
>> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net
>> ].
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=126864
>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net
> ].
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=126877
To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=126886
To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].
More information about the gridengine-users
mailing list