[GE users] Problem with commd communications

Yogesh Chaudhary yogesh.chaudhary at amd.com
Thu Jun 17 18:01:11 BST 2004


We also do not allow users to query qmaster all the time. Now we use a 
script which does qstat every couple minutes and stores in a file and 
then users query that file, so that they do not load qmaster.

but, after doing all this, still sometimes commd usage goes up and 
slows down the whole grid. All the hosts starts showing 99.99 load 
average. I do not know what else is going on. If I do not do anything, 
after sometime grid comes back to normal.

We get following messages in commd logs when something like this 
happens..
Sun Jun 13 18:16:53 2004|commd|pcsgrid|W|select error: ignoring 
commproc using fd 10 because the fd is ready to receive AND ready to send an EOF
Mon Jun 14 12:24:48 2004|commd|pcsgrid|E|commproc qsub:54631 was inactive for 301 seconds
Mon Jun 14 12:24:53 2004|commd|pcsgrid|E|commproc qstat:54649 was inactive for 301 seconds


thanks, Yogesh

On Thu, 17 Jun 2004, Craig Tierney wrote:

>
>> Once, we found that there was a user who was trying to do qstat every
>> second and this hosed commd...
>>
>> I have seen this also when a user deletes large number of jobs...
>>
>> But still we have this problem once a while..
>>
>> thanks, Yogesh
>
> Just an update.  We do have a system script that does a
> "qstat -j" on every job every 30 seconds.  This script hasn't
> changed dramatically in 2 years, so I was unlikely to blame
> it.  However, with the increase in job load, it appears that
> this was mostly to blame for the problem.  I reworked the code
> so qstat is called only once per 30 seconds, and "-j" is never
> used.
>
> With the old script, the load on sge_schedd was usally above
> 90%.  With the new script, the load is usually 0% unless it
> is actually scheduling jobs.
>
> Is there a bug in SGE?  I don't know.  I doubt I am going
> to see any more problems with the changes I made.  I know
> that SGE 6.0 has a much better communications model so I suspect
> it can hold up under this load much better.  We are planning
> to migrate in the near future.
>
> Thanks for the help.
>
> Craig
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail: users-help at gridengine.sunsource.net
>
>

--------------------------------------------------------------------------
Yogesh Chaudhary
Advanced Micro Devices,Inc ( PCS )
9500 Arboretum Blvd., Suite 400                       Phone:  512.602.5422
Austin, TX 78759                                       Fax: 512.602.5051



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list