[GE users] Problem with commd communications

Ron Chen ron_chen_123 at yahoo.com
Fri Jun 18 13:38:28 BST 2004


Developers,

Does the "event master" (or "event mirror"?) in SGE
6.0 help to offload the frequent qstat queries?

 -Ron

--- Yogesh Chaudhary <yogesh.chaudhary at amd.com> wrote:
> 
> We also do not allow users to query qmaster all the
> time. Now we use a 
> script which does qstat every couple minutes and
> stores in a file and 
> then users query that file, so that they do not load
> qmaster.
> 
> but, after doing all this, still sometimes commd
> usage goes up and 
> slows down the whole grid. All the hosts starts
> showing 99.99 load 
> average. I do not know what else is going on. If I
> do not do anything, 
> after sometime grid comes back to normal.
> 
> We get following messages in commd logs when
> something like this 
> happens..
> Sun Jun 13 18:16:53 2004|commd|pcsgrid|W|select
> error: ignoring 
> commproc using fd 10 because the fd is ready to
> receive AND ready to send an EOF
> Mon Jun 14 12:24:48 2004|commd|pcsgrid|E|commproc
> qsub:54631 was inactive for 301 seconds
> Mon Jun 14 12:24:53 2004|commd|pcsgrid|E|commproc
> qstat:54649 was inactive for 301 seconds
> 
> 
> thanks, Yogesh
> 
> On Thu, 17 Jun 2004, Craig Tierney wrote:
> 
> >
> >> Once, we found that there was a user who was
> trying to do qstat every
> >> second and this hosed commd...
> >>
> >> I have seen this also when a user deletes large
> number of jobs...
> >>
> >> But still we have this problem once a while..
> >>
> >> thanks, Yogesh
> >
> > Just an update.  We do have a system script that
> does a
> > "qstat -j" on every job every 30 seconds.  This
> script hasn't
> > changed dramatically in 2 years, so I was unlikely
> to blame
> > it.  However, with the increase in job load, it
> appears that
> > this was mostly to blame for the problem.  I
> reworked the code
> > so qstat is called only once per 30 seconds, and
> "-j" is never
> > used.
> >
> > With the old script, the load on sge_schedd was
> usally above
> > 90%.  With the new script, the load is usually 0%
> unless it
> > is actually scheduling jobs.
> >
> > Is there a bug in SGE?  I don't know.  I doubt I
> am going
> > to see any more problems with the changes I made. 
> I know
> > that SGE 6.0 has a much better communications
> model so I suspect
> > it can hold up under this load much better.  We
> are planning
> > to migrate in the near future.
> >
> > Thanks for the help.
> >
> > Craig
> >
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> > For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> >
> >
> 
>
--------------------------------------------------------------------------
> Yogesh Chaudhary
> Advanced Micro Devices,Inc ( PCS )
> 9500 Arboretum Blvd., Suite 400                     
>  Phone:  512.602.5422
> Austin, TX 78759                                    
>   Fax: 512.602.5051
> 
> 
> 
>
---------------------------------------------------------------------
> To unsubscribe, e-mail:
> users-unsubscribe at gridengine.sunsource.net
> For additional commands, e-mail:
> users-help at gridengine.sunsource.net
> 
> 



	
		
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list