[GE users] Problem with commd communications

Craig Tierney ctierney at hpti.com
Thu Jun 17 17:25:30 BST 2004


> Once, we found that there was a user who was trying to do qstat every 
> second and this hosed commd...
> 
> I have seen this also when a user deletes large number of jobs...
> 
> But still we have this problem once a while.. 
> 
> thanks, Yogesh

Just an update.  We do have a system script that does a 
"qstat -j" on every job every 30 seconds.  This script hasn't
changed dramatically in 2 years, so I was unlikely to blame
it.  However, with the increase in job load, it appears that
this was mostly to blame for the problem.  I reworked the code
so qstat is called only once per 30 seconds, and "-j" is never
used.  

With the old script, the load on sge_schedd was usally above
90%.  With the new script, the load is usually 0% unless it
is actually scheduling jobs.

Is there a bug in SGE?  I don't know.  I doubt I am going
to see any more problems with the changes I made.  I know
that SGE 6.0 has a much better communications model so I suspect
it can hold up under this load much better.  We are planning
to migrate in the near future.

Thanks for the help.

Craig


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe at gridengine.sunsource.net
For additional commands, e-mail: users-help at gridengine.sunsource.net




More information about the gridengine-users mailing list