[GE users] problem with job distributions

mad margaret_Doll at brown.edu
Tue Mar 10 21:41:17 GMT 2009


Thanks.  I will look into my setup for my PE support.

In the particular case that I reported, the user thought they had
deleted a job, but had not successfully completed the task.

On Mar 10, 2009, at 5:38 PM, mhanby wrote:

> Oh, if ompi_info reveals gridengine support, then also make sure that
> your PE is configured properly, here's mine (there might be one  
> already
> created called orte):
>
> qconf -sp openmpi
>
> pe_name           openmpi
> slots             9999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /bin/true
> stop_proc_args    /bin/true
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
>
> If the openmpi or orte PE exists and is configured properly, make sure
> that your users are actually using that PE in their job submission
> command / script. I've found many users who'll just pick any PE out of
> the list, without any consideration for whether or not it's right for
> their MPI choice.
>
> Mike
> -----Original Message-----
> From: mhanby [mailto:mhanby at uab.edu]
> Sent: Tuesday, March 10, 2009 4:34 PM
> To: users at gridengine.sunsource.net
> Subject: RE: [GE users] problem with job distributions
>
> OpenMPI should be aware of Grid Engine assuming it was compiled
> properly, so it shouldn't suffer from that problem.
>
> If compiling it on the head node, I believe that 1.2.* will detect  
> that
> grid engine is installed and will build in support. If compiling 1.3  
> and
> later, you have to provide the --with-sge switch to the configure
> script.
>
> http://www.open-mpi.org/faq/?category=building#build-rte-sge
>
> You can check whether or not openmpi has support for GE by running the
> command:
>
> ompi_info | grep gridengine
>                 MCA ras: gridengine (MCA v1.0, API v1.3, Component
> v1.2.8)
>                 MCA pls: gridengine (MCA v1.0, API v1.3, Component
> v1.2.8)
>
> -----Original Message-----
> From: mad [mailto:margaret_Doll at brown.edu]
> Sent: Tuesday, March 10, 2009 4:09 PM
> To: users at gridengine.sunsource.net
> Subject: Re: [GE users] problem with job distributions
>
> Mike,
>
> 	We are using  openmpi.  Your explanation sounds correct.
> Therefore, I conclude I just have to keep watch and ask the users
> to notify me when they delete one of their jobs, so I can monitor
> the action.
>
> On Mar 10, 2009, at 4:32 PM, mhanby wrote:
>
>> What type of MPI are the users utilizing? I see this behavior with  
>> our
>> LAM MPI jobs. The others can provide more insight but I believe that
>> this will end up being an issue with loose integrated parallel
>> libraries. Do a search for 'tight integration' for more detail.
>>
>> What is probably happening, your user is using qdel to delete their
>> job,
>> the master process for the job receives the signal to halt but  
>> doesn't
>> clean up the child tasks. As far as SGE knows, the job has been  
>> killed
>> and all slots are now available for use.
>>
>> Mike
>>
>> -----Original Message-----
>> From: mad [mailto:margaret_Doll at brown.edu]
>> Sent: Tuesday, March 10, 2009 3:05 PM
>> To: users at gridengine.sunsource.net
>> Subject: Re: [GE users] problem with job distributions
>>
>> On Mar 10, 2009, at 3:55 PM, mad wrote:
>>
>>> On Mar 10, 2009, at 3:35 PM, mad wrote:
>>>
>>>> I have compute nodes each of which have eight processors.  I have
>>>> assigned eight compute nodes to one of my queues.  The compute  
>>>> nodes
>>>> are listed as groupa   which is on the hostlist of group-a queue.
>>>> In
>>>> the General Configuration for group-a queue, I have slots listed as
>>>> 8.  When I look at the Cluster Queues, queue group-a has 64 total
>>>> slots.  Currently 52 slots are shown as
>>>> being  used in qmon.
>>>>
>>>> However,  when I execute  "qstat -f | grep group-a, I get
>>>>
>>>> group-a at compute-0-0.local       BIP   8/8       8.08     lx26-amd64
>>>> group-a at compute-0-1.local       BIP   8/8       8.06     lx26-amd64
>>>> group-a at compute-0-10.local      BIP   8/8       11.12    lx26-amd64
>>>> group-a at compute-0-11.local      BIP   2/8       10.22    lx26-amd64
>>>> group-a at compute-0-12.local      BIP   6/8       7.16     lx26-amd64
>>>> group-a at compute-0-13.local      BIP   4/8       4.78     lx26-amd64
>>>> group-a at compute-0-2.local       BIP   8/8       8.12     lx26-amd64
>>>> group-a at compute-0-3.local       BIP   8/8       15.13    lx26-
>>>> amd64    a
>>>>
>>>> Total number of slots being used is 52 which agrees with qmon.
>>>> However the load shows 59  jobs.
>>>>
>>>> If I ssh  into compute-0-3, I see 15 jobs  being used by one user.
>>>> All jobs except one is using 50% of a CPU.
>>>>
>>>> My users say they are using variations of
>>>>
>>>> qsub -pe queue-a 20 scriptp
>>>>
>>>>
>>>> Why would the distibution of jobs be so out of whack?  I have been
>>>> running this cluster with this version of the system for about six
>>>> months  now.   The only time the distribution was not even
>>>> occurred before one of my users learned to use qsub properly.
>>>>
>>>>
>>>>
>>>> Running ROCKS 5.3 with Redhat 2.6.18-53.1.14.el5
>>>
>>> I believe I found the problem.  One of the users in qmon show that
>>> she has two jobs running.  When I ssh  into the compute nodes and
>>> look at her jobs, she  has jobs submitted on three different days.
>>>
>>> I am assuming that she did not successfully delete one of the jobs
>>> that  she started.  How do  I catch these jobs except to keep a
>>> close eye  on the queues with "qstat -f."  Is there some way  of
>>> sending me an email when someone tries to delete a queued job,
>>> so that I can see if the deletion was successful.
>>>
>>> Why did the job continue, but not register  on qmon?
>>
>> qstat -g c  shows
>>
>> queue-a  1.10     52     12     64      8      0
>>
>> at one time the CGLOAD was up to 1.22
>>
>>
>>>
>>>
>>>>
>>>>
>>>> ------------------------------------------------------
>>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>> Id=126858
>>>>
>>>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net
>>>> ].
>>>
>>> ------------------------------------------------------
>>>
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>> Id=126864
>>>
>>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net
>>> ].
>>
>> ------------------------------------------------------
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
>> Id=126877
>>
>> To unsubscribe from this discussion, e-mail:
>> [users-unsubscribe at gridengine.sunsource.net].
>>
>> ------------------------------------------------------
>>
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=126886
>>
>> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net
>> ].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=126921
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessage
> Id=126938
>
> To unsubscribe from this discussion, e-mail:
> [users-unsubscribe at gridengine.sunsource.net].
>
> ------------------------------------------------------
> http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=126939
>
> To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net 
> ].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=126940

To unsubscribe from this discussion, e-mail: [users-unsubscribe at gridengine.sunsource.net].



More information about the gridengine-users mailing list