[GE users] problem with job distributions

mhanby mhanby at uab.edu
Tue Mar 10 21:33:41 GMT 2009


OpenMPI should be aware of Grid Engine assuming it was compiled
properly, so it shouldn't suffer from that problem.

If compiling it on the head node, I believe that 1.2.* will detect that
grid engine is installed and will build in support. If compiling 1.3 and
later, you have to provide the --with-sge switch to the configure
script.

http://www.open-mpi.org/faq/?category=building#build-rte-sge

You can check whether or not openmpi has support for GE by running the
command:

ompi_info | grep gridengine
                 MCA ras: gridengine (MCA v1.0, API v1.3, Component
v1.2.8)
                 MCA pls: gridengine (MCA v1.0, API v1.3, Component
v1.2.8)

-----Original Message-----
From: mad [mailto:margaret_Doll at brown.edu] 
Sent: Tuesday, March 10, 2009 4:09 PM
To: users at gridengine.sunsource.net
Subject: Re: [GE users] problem with job distributions

Mike,

	We are using  openmpi.  Your explanation sounds correct.
Therefore, I conclude I just have to keep watch and ask the users
to notify me when they delete one of their jobs, so I can monitor
the action.

On Mar 10, 2009, at 4:32 PM, mhanby wrote:

> What type of MPI are the users utilizing? I see this behavior with our
> LAM MPI jobs. The others can provide more insight but I believe that
> this will end up being an issue with loose integrated parallel
> libraries. Do a search for 'tight integration' for more detail.
>
> What is probably happening, your user is using qdel to delete their  
> job,
> the master process for the job receives the signal to halt but doesn't
> clean up the child tasks. As far as SGE knows, the job has been killed
> and all slots are now available for use.
>
> Mike
>
> -----Original Message-----
> From: mad [mailto:margaret_Doll at brown.edu]
> Sent: Tuesday, March 10, 2009 3:05 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] problem with job distributions
>
> On Mar 10, 2009, at 3:55 PM, mad wrote:
>
>> On Mar 10, 2009, at 3:35 PM, mad wrote:
>>
>>> I have compute nodes each of which have eight processors.  I have
>>> assigned eight compute nodes to one of my queues.  The compute nodes
>>> are listed as groupa   which is on the hostlist of group-a queue.   
>>> In
>>> the General Configuration for group-a queue, I have slots listed as
>>> 8.  When I look at the Cluster Queues, queue group-a has 64 total
>>> slots.  Currently 52 slots are shown as
>>> being  used in qmon.
>>>
>>> However,  when I execute  "qstat -f | grep group-a, I get
>>>
>>> group-a at compute-0-0.local       BIP   8/8       8.08     lx26-amd64
>>> group-a at compute-0-1.local       BIP   8/8       8.06     lx26-amd64
>>> group-a at compute-0-10.local      BIP   8/8       11.12    lx26-amd64
>>> group-a at compute-0-11.local      BIP   2/8       10.22    lx26-amd64
>>> group-a at compute-0-12.local      BIP   6/8       7.16     lx26-amd64
>>> group-a at compute-0-13.local      BIP   4/8       4.78     lx26-amd64
>>> group-a at compute-0-2.local       BIP   8/8       8.12     lx26-amd64
>>> group-a at compute-0-3.local       BIP   8/8       15.13    lx26-
>>> amd64    a
>>>
>>> Total number of slots being used is 52 which agrees with qmon.
>>> However the load shows 59  jobs.
>>>
>>> If I ssh  into compute-0-3, I see 15 jobs  being used by one user.
>>> All jobs except one is using 50% of a CPU.
>>>
>>> My users say they are using variations of
>>>
>>> qsub -pe queue-a 20 scriptp
>>>
>>>
>>> Why would the distibution of jobs be so out of whack?  I have been
>>> running this cluster with this version of the system for about six
>>> months  now.   The only time the distribution was not even
>>> occurred before one of my users learned to use qsub properly.
>>>
>>>
>>>
>>> Running ROCKS 5.3 with Redhat 2.6.18-53.1.14.el5
>>
>> I believe I found the problem.  One of the users in qmon show that
>> she has two jobs running.  When I ssh  into the compute nodes and
>> look at her jobs, she  has jobs submitted on three different days.
>>
>> I am assuming that she did not successfully delete one of the jobs
>> that  she started.  How do  I catch these jobs except to keep a
>> close eye  on the queues with "qstat -f."  Is there some way  of
>> sending me an email when someone tries to delete a queued job,
>> so that I can see if the deletion was successful.
>>
>> Why did the job continue, but not register  on qmon?
>
> qstat -g c  shows
>
> queue-a  1.10     52     12     64      8      0
>
> at one time the CGLOAD was up to 1.22
>
>
>>
>>
>>>
>>>
>>> ------------------------------------------------------
>>>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=126858
>>>
>>> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net
>>> ].
>>
>> ------------------------------------------------------
>>
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=126864
>>
>> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net
>> ].
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=126877
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
>
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=126886
>
> To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
Id=126921

To unsubscribe from this discussion, e-mail:
[users-unsubscribe at gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=126938

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list